About this primer
Ever since the first genome-wide association study (GWAS) on age-related macular degeneration, and the promise of personalized medicine in the wake of the Human Genome Project, large-scale genetic association studies hold significant sway in contemporary health research and drive drug-development pipelines. In the past 2 decades, researchers delved into GWAS, aiming to unveil genetic variations linked to both human traits, such as the color of your eyes, and rare and common complex diseases. These findings serve as crucial keys to unravel the intricate mechanisms underlying diseases, shedding light on whether the correlations identified in observational studies between risk factors and diseases are truly causal.
These studies have ushered in an exciting era where many researchers thrive on developing new methods and bioinformatic tools to parse ever-growing large datasets collected large population-based biobanks. However, the analyses of these data are challenging and it can be daunting to see the forest for tree among the many tools and their various functions. Enter A Practical Primer in Human Complex Genetics. This GitBook was originally written back in 2022 for the Genetic Epidemiology course organized by the Master Epidemiology of Utrecht University. This practical guide will teach you how to design a GWAS, perform quality control (QC), execute the actual analyses, annotate the GWAS results, and perform further downstream post-GWAS analyses. Throughout the book you’ll work with ‘dummy’, that is fake, data, but in the end, we will use real-world data from the first release of the Welcome Trust Case-Control Consortium (WTCCC) focusing on coronary artery disease (CAD).
A major component of modern-day GWAS is genetic imputation, but for practical reasons it is not part of this book. However, I will provide some pointers as to how to go about do this with minimal coding or scripting experience. Likewise, the courses does not cover the aspects of meta-analyses of GWAS, but some excellent resources exist to which I will direct. As this practical primer evolves, these and other topics may find their place in this book.
I should also point out that emphasis of this book is on it being a practical primer. It is intended to provide some practical guidance to doing GWAS, and while theory is important, I will not cover this. Again, some very useful and excellent work exists to which I will point you, but I really want you to learn - and understand the theory - by doing.
So, although originally crafted as a companion for the course, this practical guide stands on its own as a comprehensive resource for diving into all facets of doing a GWAS — save for experimental follow-up, of course 😉.
I can imagine this seems overwhelming, but trust me, you’ll be okay. Just follow this practical. You’ll learn by doing and at the end of the day, you can execute a GWAS independently.
Ready to start?
Steps in a Genome-Wide Association Study
Now that you understand a bit of the navigation in Unix-systems, you’re ready to start this practical primer. We will make use of a dummy dataset containing cases and controls. We will explain and execute the following steps:
- convert raw data to a more memory-efficient format
- apply extensive quality control on samples and SNPs
- assess the ancestral background of your study population
- perform association testing
- visualize association results
Converting datasets
The format in which genotype data are returned to investigators varies among genome-wide SNP platforms and genotyping centers. Usually genotypes have been called by a genotyping center and returned in the standard PED and MAP file formats designed for PLINK.
A PED file is a white space (space or tab)-delimited file in which each line represents one individual and the first six columns are mandatory and in the following order:
- ‘Family ID’,
- ‘Individual ID’,
- ‘Paternal ID’,
- ‘Maternal ID’,
- ‘Sex (1=male, 2=female, 0=missing)’, and
- ‘Phenotype (1=unaffected, 2=affected, 0=missing)’.
The subsequent columns denote genotypes that can be any character (e.g., 1, 2, 3, 4 or A, C, G, T). Zero denotes a missing genotype. Each SNP must have two alleles (i.e., both alleles are either present or absent).
The order of SNPs in the PED file is given in the MAP file, in which each line denotes a single marker and the four white-space–separated columns are chromosome (1–22, X, Y or 0 for unplaced), marker name (typically an rs number), genetic distance in Morgans (this can be fixed to 0) and base-pair position (bp units).
Let’s start by using PLINK to converting the datasets to a lighter, binary form (a .bed-file). This file saves data in a more memory- and time-efficient manner (in a ‘binary’-format) to facilitate the analysis of large-scale data sets (Purcell S. 2007). The marker-information is stored in the .bim-file and the family information in the .fam-file. PLINK creates a .log file (named raw-GWA-data.log) that details (among other information) the implemented commands, the number of cases and controls in the input files, any excluded data and the genotyping rate in the remaining data. This file is very useful for checking whether the software is successfully completing commands.
Make sure you are in the right directory. Do you remember how to get there?
cd ~/Desktop/practical
Next, we’ll make a project directory.
mkdir -v ~/Desktop/practical/dummy_project
Now, we’ll convert the .ped/.map files to the binary-format.
plink --file rawdata/raw-GWA-data --make-bed --out dummy_project/rawdata
Let’s review the .log-file for a bit. It should look something like this:
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to dummy_project/rawdata.log.
Options in effect:
--file rawdata/raw-GWA-data
--make-bed
--out dummy_project/rawdata
16384 MB RAM detected; reserving 8192 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (317503 variants, 2000 people).
--file: dummy_project/rawdata-temporary.bed +
dummy_project/rawdata-temporary.bim + dummy_project/rawdata-temporary.fam
written.
317503 variants loaded from .bim file.
2000 people (997 males, 1003 females) loaded from .fam.
2000 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2000 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 11440 het. haploid genotypes present (see dummy_project/rawdata.hh );
many commands treat these as missing.
Total genotyping rate is 0.985682.
317503 variants and 2000 people pass filters and QC.
Among remaining phenotypes, 1023 are cases and 977 are controls.
--make-bed to dummy_project/rawdata.bed + dummy_project/rawdata.bim +
dummy_project/rawdata.fam ... done.
So, there are 317,503 variants included for 2,000 people, 997 males and 1,003 females. All of these individuals are ‘founders’. There are 1,023 cases and 977 controls. The genotyping rate is about 98.6% which is pretty good. Lastly, there are 11,440 heterozygous haploid genotypes present.
Question: Can you think off what the ‘11,440 heterozygous haploid genotypes present’ represent?
Quality control
We are ready for some quality control and quality assurance, heavily inspired by Anderson et al. (Anderson C. A. 2010) and Laurie et al. (Laurie C. C. 2010). In general, we should check out a couple of things regarding the data quality on two levels:
- samples
- variants
So, we will investigate the following:
- Are the sexes based on genetic data matching the ones given by the phenotype file?
- Identify individuals that are outliers in terms of missing data (call rate) or heterozygosity rates. This could indicate a genotyping error or sample swap.
- Identify duplicated or related individuals.
- Identify individuals with divergent ancestry.
- What are the allele frequencies?
- What is the per-SNP call rate?
- In the case of a case-control study (which is the case here), we need to check differential missingness between cases and controls.
Question: Can you think of other scenarios in which you may want to extend the check on differential missingness beyond a check between cases and controls?
Let’s get our hands dirty
All clear? Let’s start the work. On to step 1 of the QC for GWAS: filter samples of poor quality in Chapter 3.
Sample QC
Let’s start with the per-sample quality control. The goal here is to identify all the samples that for some reason (mentioned in the previous section) should be excluded.
Sex
We need to identify of individuals with discordant sex information comparing phenotypic and genotypic data. Let’s calculate the mean homozygosity rate across X-chromosome markers for each individual in the study.
plink --bfile dummy_project/rawdata --check-sex --out dummy_project/rawdata
This produces a file with the following columns:
- FID Family ID
- IID Within-family ID
- PEDSEX Sex code in input file
- SNPSEX Imputed sex code (1 = male, 2 = female, 0 = unknown)
- STATUS ‘OK’ if PEDSEX and SNPSEX match and are nonzero, ‘PROBLEM’ otherwise
- F Inbreeding coefficient, considering only X chromosome. Not present with ‘y-only’.
- YCOUNT Number of nonmissing genotype calls on Y chromosome. Requires ‘ycount’/‘y-only’.
We need to get a list of individuals with discordant sex data.
cat dummy_project/rawdata.sexcheck | awk '$5 =="STATUS" || $5 =="PROBLEM"' > dummy_project/rawdata.sexprobs.txt
Let’s have a look at the results.
cat dummy_project/rawdata.sexprobs.txt
Table 3.1: Sex issues.
FID | IID | PEDSEX | SNPSEX | STATUS | F |
|---|
772 | 772 | 2 | 0 | PROBLEM | 0.3084 |
853 | 853 | 2 | 0 | PROBLEM | 0.3666 |
1,920 | 1,920 | 2 | 0 | PROBLEM | 0.4066 |
When the homozygosity rate (F) is more than 0.2, but less than 0.8, the genotype data are inconclusive regarding the sex of an individual and these are marked in column SNPSEX with a 0, and the column STATUS “PROBLEM”.
Report the IDs of individuals with discordant sex information (Table 3.1) to those who conducted sex phenotyping. In situations in which discrepancy cannot be resolved, add the family ID (FID) and individual ID (IID) of the samples to a file named fail-sexcheck-qc.txt (one individual per line, tab delimited).
grep "PROBLEM" dummy_project/rawdata.sexcheck | awk '{ print $1, $2}' > dummy_project/fail-sexcheck-qc.txt
Sample call rates
Let’s get an overview of the missing data per sample and per SNP.
plink --bfile dummy_project/rawdata --missing --out dummy_project/rawdata
This produces two files, rawdata/rawdata.imiss and rawdata/rawdata.lmiss. In the .imiss-file the N_MISS column denotes the number of missing SNPs, and the F_MISS column denotes the proportion of missing SNPs per individual.
The grey dashed line in Figure 3.1 indicates the mean call rate, while the red dashed line indicates the threshold we had determined above.
Heterozygosity rate
To properly calculate heterozygosity rate and relatedness (identity-by-descent [IBD]) we need to do four things:
- pre-clean the data to get a high-quality set,
- of independent SNPs,
- exclude long-range linkage disequilibrium (LD) blocks that bias with these calculations, and
- exclude A/T and C/G SNPs as these may be ambivalent in interpretation when frequencies between cases and controls are close (MAF ± 0.45),
- remove all non-autosomal SNPs.
You can find an up-to-date list of LD blocks you should exclude in these types of analyses here for the different genome builds. In this case we are using build 37. For the purpose of this book we included a file with these regions in the support-directory.
We will use the following settings:
- remove A/T and C/G SNPs with the flag
--exclude dummy_project/all.atcg.variants.txt,
- call rate <1% with the flag
--geno 0.10,
- Hardy-Weinberg Equilibrium (HWE) p-value > 1x10-3 with the flag
--hwe 1e-3,
- and MAF>10% with the flag
--maf 0.10,
- prune the data to only select independent SNPs (with low LD r^2) of one pair each with
r^2 = 0.2 with the flags --indep-pairwise 100 10 0.2 and --extract rawdata/raw-GWA-data.prune.in,
- SNPs in long-range LD regions (for example: MHC chr 6 25.8-36Mb, chr 8 inversion 6-16Mb, chr17 40-45Mb, and a few more) with the flag
--exclude range support/exclude_problematic_range.txt,
- remove non-autosomal SNPs with the flag
--autosome.
First, get a list of A/T and C/G SNPs. Remember, the list of markers for this GWAS is noted in the .bim file. We can simply grep all the lines where the two alleles either have an A/T or C/G combination.
cat dummy_project/rawdata.bim | \
awk '($5 == "A" && $6 == "T") || ($5 == "T" && $6 == "A") || ($5 == "C" && $6 == "G") || ($5 == "G" && $6 == "C")' | awk '{ print $2, $1, $4, $3, $5, $6 }' \
> dummy_project/all.atcg.variants.txt
Second, clean the data and get a list of independent SNPs.
plink --bfile dummy_project/rawdata \
--autosome \
--maf 0.10 --geno 0.10 --hwe 1e-3 \
--indep-pairwise 100 10 0.2 \
--exclude range support/exclude_problematic_range.txt \
--make-bed --out dummy_project/rawdata.clean.temp
Please note, we have create a dataset without taking into account LD structure. Hence, the message ‘Pruned 0 variants from chromosome 1, leaving 19420.’ etc. In a dataset without any LD structure this flag --indep-pairwise 100 10 0.2 doesn’t actually work. However, with real-data you can use it to prune out unwanted SNPs in high LD.
Third, exclude the pruned SNPs. Note, how we include a file to exclude high-LD for the purpose of the practical.
plink --bfile dummy_project/rawdata.clean.temp \
--extract rawdata/raw-GWA-data.prune.in \
--make-bed --out dummy_project/rawdata.clean.ultraclean.temp
Fourth, remove the A/T and C/G SNPs.
plink --bfile dummy_project/rawdata.clean.ultraclean.temp \
--exclude dummy_project/all.atcg.variants.txt \
--make-bed --out dummy_project/rawdata.clean.ultraclean
Please note, this dataset doesn’t actually include this type of SNP, hence rawdata/all.atcg.variants.txt is empty! Again, you can use this command in real-data to exclude A/T and C/G SNPs.
Lastly, remove the temporary files.
rm -fv dummy_project/*.temp*
Finally, we can calculate the heterozygosity rate.
plink --bfile dummy_project/rawdata.clean.ultraclean --het --out dummy_project/rawdata.clean.ultraclean
This creates the file dummy_project/rawdata.clean.ultraclean.het, in which the third column denotes the observed number of homozygous genotypes, O(Hom), and the fifth column denotes the number of nonmissing genotypes, N(NM), per individual. We can now calculate the observed heterozygosity rate per individual using the formula (N(NM) - O(Hom))/N(NM).
Often there is a correlation between heterozygosity rate and missing data. Thus, we should plot the observed heterozygosity rate per individual on the x-axis and the proportion of missing SNP, that is the ‘SNP call rate’, per individuals on the y-axis (Figure 3.2).
Examine the plot (Figure 3.2) to decide reasonable thresholds at which to exclude individuals based on elevated missing or extreme heterozygosity. We chose to exclude all individuals with a genotype failure rate >= 0.03 (vertical dashed line) and/or a heterozygosity rate ± 3 s.d. from the mean (horizontal dashed lines). Add the FID and IID of the samples failing this QC to the file named fail-imisshet-qc.txt.
How would you create this file?
If all is right, you’d have something like Table 3.2.
Table 3.2: Failed samples due to sample call rates and heterozygosity rate.
FID | IID |
|---|
28 | 28 |
50 | 50 |
52 | 52 |
53 | 53 |
67 | 67 |
96 | 96 |
205 | 205 |
298 | 298 |
310 | 310 |
351 | 351 |
404 | 404 |
418 | 418 |
423 | 423 |
428 | 428 |
442 | 442 |
444 | 444 |
548 | 548 |
772 | 772 |
835 | 835 |
850 | 850 |
853 | 853 |
910 | 910 |
939 | 939 |
966 | 966 |
968 | 968 |
1,003 | 1,003 |
1,006 | 1,006 |
1,045 | 1,045 |
1,058 | 1,058 |
1,154 | 1,154 |
1,236 | 1,236 |
1,294 | 1,294 |
1,395 | 1,395 |
1,537 | 1,537 |
1,554 | 1,554 |
1,587 | 1,587 |
1,694 | 1,694 |
1,789 | 1,789 |
1,832 | 1,832 |
1,866 | 1,866 |
1,904 | 1,904 |
1,920 | 1,920 |
1,951 | 1,951 |
1,952 | 1,952 |
1,953 | 1,953 |
1,954 | 1,954 |
1,955 | 1,955 |
1,981 | 1,981 |
1,984 | 1,984 |
1,985 | 1,985 |
1,986 | 1,986 |
Ancestral background
Using a Principal Component Analysis (PCA) we can reduce the dimensions of the data, and project the “ancestral distances”. In other words, the principal component 1 (the first dimension) and principal component 2 (the second dimension) which will capture most of the variation in the data and represent how much each sample is alike the next. And when compared to a reference, you can deduce the ancestral background of each sample in your dataset. Of course this is relative: we will only know that a given sample is very much a like samples from a given population that exists today.
Nowadays we run such PCA against a large and diverse dataset containing many different populations. Old-school GWAS (pre-2009) would compare a dataset against HapMap 3, nowadays we prefer at a minimum the 1000G phase 3 populations. And in those ancient times the preferred software to run a PCA was Eigensoft which is a bit tricky to install (see Chapter ??), but nowadays PLINK provides the --pca-flag.
For the purpose of this practical primer we will run PCA using PLINK and its --pca-flag against an earlier version of 1000G, phase 1, which is slightly smaller and just as good to use.
1000G phase 1
We will project our data to the reference, in this example 1000G phase 1 (1000G), which includes individuals from 14 distinct global populations across 4 ‘super’-populations (Europeans [EUR], Africans [AFR], East-Asians [EAS], and Latin Americans [AMR]). In the real-world, using phase 1 may be just fine, but if you think your population evolved through extensive migration it’s probably best to use phase 3 data. In other words, the choice of reference is really depending on the dataset.
First, we will merge our data with 1000G. The alleles at each marker must be aligned to the same DNA strand to allow our data to merge correctly. Because not all SNPs are required for this analysis the A->T and C->G SNPs, which are more difficult to align, can be omitted.
Filter the 1000G data
First, we should get a list of relevant variants from our rawdata-dataset. We don’t need the other variants present in the 1000G dataset, right?
cat dummy_project/rawdata.bim | grep "rs" > dummy_project/all.variants.txt
Extract those from the 1000G phase 1 data.
plink --bfile ref_1kg_phase1_all/1kg_phase1_all --extract dummy_project/all.variants.txt --make-bed --out ref_1kg_phase1_all/1kg_phase1_raw
Filter A/T & C/G SNPs
As explained, the A/T and C/G SNPs are problematic, we want to exclude these too. So let’s get a list of A/T and C/G variants from 1000G to exclude - this may take a while.
cat ref_1kg_phase1_all/1kg_phase1_raw.bim | \
awk '($5 == "A" && $6 == "T") || ($5 == "T" && $6 == "A") || ($5 == "C" && $6 == "G") || ($5 == "G" && $6 == "C")' | awk '{ print $2, $1, $4, $3, $5, $6 }' \
> ref_1kg_phase1_all/all.1kg.atcg.variants.txt
Exclude those A/T and C/G variants in both datasets and at the same time filter to only retain high-quality data and exclude non-autosomal variants.
plink --bfile ref_1kg_phase1_all/1kg_phase1_raw --exclude ref_1kg_phase1_all/all.1kg.atcg.variants.txt -make-bed --out ref_1kg_phase1_all/1kg_phase1_raw_no_atcg
plink --bfile dummy_project/rawdata --exclude ref_1kg_phase1_all/all.1kg.atcg.variants.txt --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg
Merging datasets
Try and merge the data.
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg --make-bed --out dummy_project/rawdata.1kg_phase1
There probably is an error …
Error: 72 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
dummy_project/rawdata.1kg_phase1-merge.missnp.
(Warning: if the subsequent merge seems to work, strand errors involving SNPs
with A/T or C/G alleles probably remain in your data. If LD between nearby
SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
So let’s flip some variants.
plink --bfile dummy_project/rawdata --exclude ref_1kg_phase1_all/all.1kg.atcg.variants.txt --flip dummy_project/rawdata.1kg_phase1-merge.missnp --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg
Let’s try again and merge the data.
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg --make-bed --out dummy_project/rawdata.1kg_phase1
There still is an error – there are a few multi-allelic variants present which PLINK can’t handle.
Error: 14 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
dummy_project/rawdata.1kg_phase1-merge.missnp.
(Warning: if the subsequent merge seems to work, strand errors involving SNPs
with A/T or C/G alleles probably remain in your data. If LD between nearby
SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
Let’s just remove these multi-allelic variants.
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --exclude dummy_project/rawdata.1kg_phase1-merge.missnp --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi
After removing those pesky multi-allelic variants, we should be able to merge the data. We should take of the following:
- extract the pruned SNP-set (remember?),
--extract rawdata/raw-GWA-data.prune.i,
- exclude non-autosomal variants,
--autosome,
- and only keeping high-quality data,
--maf 0.10 --geno 0.10 --hwe 1e-3,
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi \
--bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg \
--autosome \
--maf 0.10 --geno 0.10 --hwe 1e-3 \
--extract rawdata/raw-GWA-data.prune.in \
--make-bed --out dummy_project/rawdata.1kg_phase1.clean
Before we continue it’s best to clean up a bit of the mess.
rm -fv dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi* ref_1kg_phase1_all/1kg_phase1_raw_no_atcg* dummy_project/rawdata.1kg_phase1.pruned* dummy_project/rawdata_1kg_phase1_raw_no_atcg* dummy_project/rawdata.1kg_phase1-merge* dummy_project/rawdata.1kg_phase1.bed dummy_project/rawdata.1kg_phase1.bim dummy_project/rawdata.1kg_phase1.fam dummy_project/rawdata.1kg_phase1.hh dummy_project/rawdata.1kg_phase1.log ref_1kg_phase1_all/1kg_phase1_raw.*
Now we have prepared our dataset with only high-quality SNPs that have few missing data, that are high-frequent, exclude problematic genomic ranges, and merged to the 1000G phase 1 reference dataset. Your output should look something like this:
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to dummy_project/rawdata.1kg_phase1.clean.log.
Options in effect:
--autosome
--bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi
--bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg
--extract rawdata/raw-GWA-data.prune.in
--geno 0.10
--hwe 1e-3
--maf 0.10
--make-bed
--out dummy_project/rawdata.1kg_phase1.clean
16384 MB RAM detected; reserving 8192 MB for main workspace.
2000 people loaded from dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi.fam.
1092 people to be merged from ref_1kg_phase1_all/1kg_phase1_raw_no_atcg.fam.
Of these, 1092 are new, while 0 are present in the base dataset.
Warning: Multiple positions seen for variant 'rs3934834'.
Warning: Multiple positions seen for variant 'rs3737728'.
Warning: Multiple positions seen for variant 'rs6687776'.
Warning: Multiple chromosomes seen for variant 'rs1050301'.
Warning: Multiple chromosomes seen for variant 'rs4850'.
317476 markers loaded from dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi.bim.
312239 markers to be merged from ref_1kg_phase1_all/1kg_phase1_raw_no_atcg.bim.
Of these, 14 are new, while 312225 are present in the base dataset.
312190 more multiple-position warnings: see log file.
Performing single-pass merge (3092 people, 308317 variants).
Merged fileset written to dummy_project/rawdata.1kg_phase1.clean-merge.bed +
dummy_project/rawdata.1kg_phase1.clean-merge.bim +
dummy_project/rawdata.1kg_phase1.clean-merge.fam .
308317 variants loaded from .bim file.
3092 people (1522 males, 1570 females) loaded from .fam.
3092 phenotype values loaded from .fam.
--extract: 49856 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3078 founders and 14 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.994867.
299 variants removed due to missing genotype data (--geno).
--hwe: 13825 variants removed due to Hardy-Weinberg exact test.
2617 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
33115 variants and 3092 people pass filters and QC.
Phenotype data is quantitative.
--make-bed to dummy_project/rawdata.1kg_phase1.clean.bed +
dummy_project/rawdata.1kg_phase1.clean.bim +
dummy_project/rawdata.1kg_phase1.clean.fam ... done.
So in total there are 3,092 individuals, 1,522 males and 1,570 females, and 3,078 founders and 14 non-founders. The total genotyping rate is 99.5% and 33,115 variants are present.
Principal component analysis
Great, we’ve prepared our dummy project data and merged this with 1000G phase 1. Let’s execute the PCA using --pca in PLINK.
plink --bfile dummy_project/rawdata.1kg_phase1.clean --pca --out dummy_project/rawdata.1kg_phase1.clean
Plotting the PCA results
If all is peachy, you just succesfully ran PCA against 1000G phase 1. Using --pca we have calculated principal components (PCs), 20 in total by default, and we can now start plotting them. Let’s create a scatter diagram of the first two principal components, including all individuals in the file rawdata.1kg_phase1.clean.eigenvec (the first and second principal components are columns 3 and 4, respectively). We need to collect some per-sample information to color the points according to sample origin.
First we collect the results from the --pca, the dummy data phenotype information, and the reference population information.
Derive PC1 and PC2 thresholds so that only individuals who match the given ancestral population are included. For populations of European descent, this will be either the CEU or TSI 1000G individuals (Figure 3.3). Here, we chose to exclude all individuals with a first principal component score less than 0.0023.
Write the FID and IID of these individuals to a file called fail-ancestry-QC.txt.
cat dummy_project/rawdata.1kg_phase1.clean.eigenvec | \
awk '$3 < 0.0023' | awk '{ print $1, $2 }' > dummy_project/fail-ancestry-QC.txt
Choosing which thresholds to apply (and thus which individuals to remove) is not a straightforward process. The key is to remove those individuals with greatly divergent ancestry, as these samples introduce the most bias to the study. Identification of more fine-scale ancestry can be conducted by using less divergent reference samples (e.g., within Europe, stratification could be identified using the CEU, TSI (Italian), GBR (British), FIN (Finnish) and IBS (Iberian) samples from the 1,000 Genomes Project (http://www.1000genomes.org/)). Robust identification of fine-scale population structure often requires the construction of many (2–10) principal components.
Removing samples
Finally! We have a list of samples of poor quality or divergent ancestry, and duplicated or related samples. We should remove these. Let’s collect all IDs from our fail-*-files into a single file.
cat dummy_project/fail-* | sort -k1 | uniq > dummy_project/fail-qc-inds.txt
This new file should now contain a list of unique individuals failing the previous QC steps which we want to remove.
plink --bfile dummy_project/rawdata --remove dummy_project/fail-qc-inds.txt --make-bed --out dummy_project/clean_inds_data
Question: How many variants and samples are left? How many cases and how many controls did you loose?
The next step
Now that you filtered samples, we should turn our attention to step 2 of the QC for GWAS: identify SNPs of poor quality in Chapter @ref(gwas_basics_snp_qc).
Per-SNP QC
Now that we removed samples, we can focus on low-quality variants.
SNP call rates
We start by calculating the missing genotype rate for each SNP, in other words the per-SNP call rate.
plink --bfile dummy_project/clean_inds_data --missing --out dummy_project/clean_inds_data
Let’s visualize the results to identify a threshold for extreme genotype failure rate. We chose a callrate threshold of 3%, but it’s arbitrary and depending on the dataset, the data (visualization), and the number of samples (Figure 4.1).
Differential SNP call rates
There could also be differences in genotype call rates between cases and controls. It is very important to check for this because these differences could lead to spurious associations. We can test all markers for differences in call rate between cases and controls, or based on other criteria.
plink --bfile dummy_project/clean_inds_data --test-missing --out dummy_project/clean_inds_data
Let’s collect all the SNPs with a significantly different (P < 0.00001) missing data rate between cases and controls.
cat dummy_project/clean_inds_data.missing | awk '$5 < 0.00001' | awk '{ print $2 }' > dummy_project/fail-diffmiss-qc.txt
Allele frequencies
We should also get an idea on what the allele frequencies are in our dataset. Low frequent SNPs should probably be excluded, as these are uninformative when monomorphic (allele frequency = 0), or they may lead to spurious associations.
plink --bfile dummy_project/clean_inds_data --freq --out dummy_project/clean_inds_data
Let’s also plot these data. You can view the result below, and try it yourself (Figure 4.2).
A note on allele coding
Oh, one more thing about alleles.
PLINK codes alleles as follows:
A1 = minor allele, the least frequent allele
A2 = major allele, the most frequent allele
And when you use PLINK the flag --freq or --maf is always relative to the A1-allele, as is the odds ratio (OR) or effect size (beta).
However, SNPTEST makes use of the so-called OXFORD-format, this codes alleles as follows:
A = the ‘other’ allele
B = the ‘coded’ allele
When you use SNPTEST it will report the allele frequency as CAF, in other words the coded allele frequency, and the effect size (beta) is always relative to the B-allele. This means, CAF could be the MAF, or minor allele frequency, but this is not a given.
In other words, always make sure what the allele-coding of a given program, be it PLINK, SNPTEST, GCTA, et cetera, is! I cannot stress this enough. Ask yourself: ‘what is the allele frequency referring to?’, ‘the effect size is relative to…?’.
Right, let’s continue.
Hardy-Weinberg Equilibrium
Because we are performing a case-control genome-wide association study, we probably expect some differences in Hardy-Weinberg Equilibrium (HWE), but extreme deviations are probably indicative of genotyping errors.
plink --bfile dummy_project/clean_inds_data --hardy --out dummy_project/clean_inds_data
Let’s also plot these data. You can view the result below, and type over the code to do it yourself.
Final SNP QC
We are ready to perform the final QC. After inspecting the graphs we will filter on a MAF < 0.01, call rate < 0.05, and HWE < 0.00001 (Figure 4.3), in addition those SNPs that failed the differential call rate test will be removed.
plink --bfile dummy_project/clean_inds_data --exclude dummy_project/fail-diffmiss-qc.txt --maf 0.01 --geno 0.05 --hwe 0.00001 --make-bed --out dummy_project/cleandata
A Milestone
Congratulations. You reached a very important milestone. Now that you filtered samples and SNPs, we can finally start the association analyses in Chapter @ref(gwas_testing).
Genome-Wide Association Study
Now that you have learned how to perform QC, you can easily run a GWAS and execute some downstream visualization and analyses. Let’s do this with a dummy dataset.
Exploring the data
Even though someone says that the QC was done, it is still wise and good practice to run some of the commands above to get a ‘feeling’ about the data. So let’s do this.
plink --bfile gwas/gwa --freq --out dummy_project/gwa
plink --bfile gwas/gwa --missing --out dummy_project/gwa
plink --bfile gwas/gwa --hardy --out dummy_project/gwa
Let’s visualize the results. First we should load in all the results.
Question: Load the data using R. [Hint: use and adapt the examples from the previous chapters.]
We can plot the per-stratum HWE p-values.
Question: Plot the per-stratum HWE p-values using R. [Hint: use and adapt the examples from the previous chapters.]
We will want to see what the distribution of allele frequencies looks like.
Question: Plot the allele frequencies using R. [Hint: use and adapt the examples from the previous chapters.]
We will want to identify samples that have poor call rates.
Question: Plot the per-sample call rates using R. [Hint: use and adapt the examples from the previous chapters.]
We also need to know what the per SNP call rates are.
Question: Plot the per-SNP call rates using R. [Hint: use and adapt the examples from the previous chapters.]
Genetic models
A simple chi-square test of association can be done.
plink --bfile gwas/gwa --model --out gwas/data
Genotypic, dominant and recessive tests will not be conducted if any one of the cells in the table of case-control by genotype counts contains less than five observations. This is because the chi-square approximation may not be reliable when cell counts are small. For SNPs with MAFs < 5%, a sample of more than 2,000 cases and controls would be required to meet this threshold and more than 50,000 would be required for SNPs with MAF < 1%.
You can change this default behaviour by adding the flag --cell, e.g., we could lower the threshold to 3.
plink --bfile gwas/gwa --model --cell 3 --out gwas/data
Let’s review the contents of the results.
It contains 1,530,510 rows, one for each SNP, and each type of test (genotypic, trend, allelic, dominant, and recessive) and the following columns:
- chromosome [
CHR],
- the SNP identifier [
SNP],
- the minor allele [
A1] (remember, PLINK always codes the A1-allele as the minor allele!),
- the major allele [
A2],
- the test performed [
TEST]:
GENO (genotypic association);
TREND (Cochran-Armitage trend);
ALLELIC (allelic as- sociation);
DOM (dominant model); and
REC (recessive model)],
- the cell frequency counts for cases [
AFF],
- the cell frequency counts for controls [
UNAFF],
- the chi-square test statistic [
CHISQ],
- the degrees of freedom for the test [
DF],
- and the asymptotic P value [
P] of association.
Question: Do you know which model, i.e. TEST is most commonly used and reported? And why is that, do think?
Logistic regression
We can also perform a test of association using logistic regression. In this case we might want to correct for covariates/confounding factors, for example age, sex, ancestral background, i.e. principal components, and other study specific covariates (e.g. hospital of inclusion, genotyping centre etc.). In that case each of these P values is adjusted for the effect of the covariates.
When running a regression analysis, be it linear or logistic, PLINK assumes a multiplicative model. By default, when at least one male and one female is present, sex (male = 1, female = 0) is automatically added as a covariate on X chromosome SNPs, and nowhere else. The sex flag causes it to be added everywhere, while no-x-sex excludes it.
plink --bfile gwas/gwa --logistic sex --covar gwas/gwa.covar --out gwas/data
Let’s examine the results.
## [1] 918306 9
## CHR SNP BP A1 TEST NMISS OR STAT P
## <int> <char> <int> <char> <char> <int> <num> <num> <num>
## 1: 1 rs3934834 995669 T ADD 3818 1.0290 0.38120 0.7031
## 2: 1 rs3934834 995669 T AGE 3818 1.0020 1.11800 0.2635
## 3: 1 rs3934834 995669 T SEX 3818 1.0120 0.19090 0.8486
## 4: 1 rs3737728 1011278 A ADD 3982 1.0190 0.38670 0.6990
## 5: 1 rs3737728 1011278 A AGE 3982 1.0020 1.09800 0.2721
## 6: 1 rs3737728 1011278 A SEX 3982 1.0060 0.09898 0.9212
## 7: 1 rs6687776 1020428 T ADD 3915 0.9692 -0.33330 0.7389
## 8: 1 rs6687776 1020428 T AGE 3915 1.0020 1.04000 0.2984
## 9: 1 rs6687776 1020428 T SEX 3915 1.0150 0.23690 0.8127
Question: How come there are more lines in this file than there are variants?
If no model option is specified, the first row for each SNP corresponds to results for a multiplicative test of association. The C >= 0 subsequent rows for each SNP correspond to separate tests of significance for each of the C covariates included in the regression model. We can remove the covariate-specific lines from the main report by adding the hide-covar flag.
The columns in the association results are:
- the chromosome [
CHR],
- the SNP identifier [
SNP],
- the base-pair location [
BP],
- the minor allele [
A1],
- the test performed [
TEST]: ADD (multiplicative model or genotypic model testing additivity),
GENO_2DF (genotypic model),
DOMDEV (genotypic model testing deviation from additivity),
DOM (dominant model), or
REC (recessive model)],
- the number of missing individuals included [
NMISS],
- the
OR relative to the A1, i.e. minor allele,
- the coefficient z-statistic [
STAT], and
- the asymptotic P-value [
P] of association.
We need to calculate the standard error and confidence interval from the z-statistic. We can modify the effect size (OR) to output the beta by adding the beta flag.
Question: Can you write down the mathematical relation between beta and OR?
Let’s get visual
Looking at numbers is important, but it won’t give you a perfect overview. We should turn to visualizing our results in Chapter 6.
GWAS visualisation
Data visualization is key, not only for presentation but also to inspect the results.
QQ plots
We should create quantile-quantile (QQ) plots to compare the observed association test statistics with their expected values under the null hypothesis of no association and so assess the number, magnitude and quality of true associations.
First, we will add the standard error, call rate, A2, and allele frequencies.
## Key: <SNP>
## SNP CHR BP A1 NMISS OR STAT P A2 MAF
## <char> <int> <int> <char> <int> <num> <num> <num> <char> <num>
## 1: rs10000010 4 21227772 C 3996 1.0420 0.9010 0.36760 T 0.4258
## 2: rs10000023 4 95952929 T 3957 0.9902 -0.2160 0.82900 G 0.4841
## 3: rs10000030 4 103593179 A 3991 0.9779 -0.3696 0.71170 G 0.1616
## 4: rs1000007 2 237416793 C 4000 1.0180 0.3649 0.71520 T 0.3122
## 5: rs10000092 4 21504615 C 3963 0.9240 -1.6770 0.09354 T 0.3430
## 6: rs10000121 4 157793485 G 3919 0.9665 -0.7525 0.45170 A 0.4532
## NCHROBS callrate
## <int> <num>
## 1: 7992 0.99900
## 2: 7914 0.98925
## 3: 7982 0.99775
## 4: 8000 1.00000
## 5: 7926 0.99075
## 6: 7838 0.97975
## [1] 306102 14
## # A tibble: 6 × 14
## SNP CHR BP A1 A2 MAF callrate NMISS NCHROBS BETA SE
## <chr> <int> <int> <chr> <chr> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 rs10000… 4 2.12e7 C T 0.426 0.999 3996 7992 0.0411 0.0457
## 2 rs10000… 4 9.60e7 T G 0.484 0.989 3957 7914 -0.00985 0.0456
## 3 rs10000… 4 1.04e8 A G 0.162 0.998 3991 7982 -0.0223 0.0605
## 4 rs10000… 2 2.37e8 C T 0.312 1 4000 8000 0.0178 0.0489
## 5 rs10000… 4 2.15e7 C T 0.343 0.991 3963 7926 -0.0790 0.0471
## 6 rs10000… 4 1.58e8 G A 0.453 0.980 3919 7838 -0.0341 0.0453
## # ℹ 3 more variables: OR <dbl>, STAT <dbl>, P <dbl>
Let’s list the number of SNPs per chromosome. This gives a pretty good idea about the per-chromosome coverage. And it’s a sanity check: did the whole analysis run properly (we expect 22 chromosomes)?
Question: Why do the number of variants per chrosome (approximately) correlate with the chromosome number?
Question: Where are the data for chromosome X, Y and MT?
Let’s plot the QQ plot to diagnose our GWAS.
Manhattan plots
We also need to create a Manhattan plot to display the association test P-values as a function of chromosomal location and thus provide a visual summary of association test results that draw immediate attention to any regions of significance (Figure 6.2).
Other plots
It is also informative to plot the density per chromosome. We can use the CMplot for that which you can find here. For now we just make these graphs ‘quick-n-dirty’, you can further prettify them, but you easily loose track of time, so maybe carry on.
Question: What do the grey spots on the density plot indicate?
This would lead to the following graphs.
Interactive plots
You can also make an interactive version of the Manhattan - just because you can. The code below shows you how.
library(plotly)
library(dplyr)
# Prepare the dataset (as an example we use the data (gwasResults) from the `qqman`-package)
don <- gwasResults %>%
# Compute chromosome size
group_by(CHR) %>%
summarise(chr_len=max(BP)) %>%
# Calculate cumulative position of each chromosome
mutate(tot=cumsum(chr_len)-chr_len) %>%
select(-chr_len) %>%
# Add this info to the initial dataset
left_join(gwasResults, ., by=c("CHR"="CHR")) %>%
# Add a cumulative position of each SNP
arrange(CHR, BP) %>%
mutate( BPcum=BP+tot) %>%
# Add highlight and annotation information
mutate( is_highlight=ifelse(SNP %in% snpsOfInterest, "yes", "no")) %>%
# Filter SNP to make the plot lighter
filter(-log10(P)>0.5)
# Prepare X axis
axisdf <- don %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )
# Prepare text description for each SNP:
don$text <- paste("SNP: ", don$SNP, "\nPosition: ", don$BP, "\nChromosome: ", don$CHR, "\nLOD score:", -log10(don$P) %>% round(2), "\nWhat else do you wanna know", sep="")
# Make the plot
p <- ggplot(don, aes(x=BPcum, y=-log10(P), text=text)) +
# Show all points
geom_point( aes(color=as.factor(CHR)), alpha=0.8, size=1.3) +
scale_color_manual(values = rep(c("grey", "skyblue"), 22 )) +
# custom X axis:
scale_x_continuous( label = axisdf$CHR, breaks= axisdf$center ) +
scale_y_continuous(expand = c(0, 0)) + # remove space between plot area and x axis
# Add highlighted points
geom_point(data=subset(don, is_highlight=="yes"), color="orange", size=2) +
# Custom the theme:
theme_bw() +
theme(
legend.position="none",
panel.border = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
ggplotly(p, tooltip="text")
It will produce something like this.

Again, this is an example with dummy data - you can try to do it for our GWAS, but careful with the time. You can also choose to carry on.
You will encounter the above types of visualizations in any high-quality GWAS paper, because each is so critically informative. Usually, analysts of large-scale meta-analyses of GWAS will also stratify the QQ-plots based on the imputation quality (if your GWAS was imputed), call rate, and allele frequency (although that is rarely shared in publications, not even in supplemental material).
Stop playing around
Alright. It’s time to stop playing around and do a quick recap of what you’ve learned.
- You learned how to convert datasets.
- You learned how to execute sample QC and create diagnostic graphics
- You learned how to do the same for SNP QC
- You learned how to execute an association study given a dataset, covariates, and different assumptions regarding the genetic model.
- You learned how to visualize results and played around with different visuals.
You should be ready for the real stuff. And if not, the next chapter will help you get ready: Chapter @ref(wtccc1_intro).
Licenses and disclaimers
Copyright
This book and all its material (“content”) is protected by copyright under Dutch Copyright laws and is the property of the author or the party credited as the provider of the content. You may not copy, reproduce, distribute, publish, display, perform, modify, create derivative works, transmit, or in any way exploit any such content, nor may you distribute any part of this content over any network, including a local area network, sell or offer it for sale, or use such content to construct any kind of database. You may not alter or remove any copyright or other notice from copies of the content on this website. Copying or storing any content except as provided above is expressly prohibited without prior written permission of the author or the copyright holder identified in the individual content’s copyright notice. For permission to use this content, please contact the author.
Disclaimer
The content contained herein is provided only for educational and informational purposes or as required by Dutch law. The author attempted to ensure that content is accurate and obtained from reliable sources, but does not represent it to be error-free. The author may add, amend or repeal any text, procedure or regulation, and failure to timely post such changes to this book shall not be construed as a waiver of enforcement. The author does not warrant that any functions on this website or the contents and references herein will be uninterrupted, that defects will be corrected, or that this website or the contents and references will be free from viruses or other harmful components. Any links to third party information on the author’s website are provided as a courtesy and do not constitute an endorsement of those materials or the third party providing them.
Images and data used
I took the at-most care to use refer to the original works and data sources where needed. Likewise, all the images c.q. figures are either produced specifically for this book, or I took them from Unsplash to brighten up the book. If you feel I made a mistake and your work should be properly referenced, please don’t hesitate to contact me.
These are the images from Unsplash listed here in no particular order.
--- 
title: "A Practical Primer in Human Complex Genetics"
subtitle: "with a use-case in cardiovascular disease"
author: "[dr. Sander W. van der Laan](https://vanderlaanand.science) [![](./img/_logo/twitter_circle_blue.png){width=2%}](https://www.twitter.com/swvanderlaan) [![](./img/_logo/email_circle_blue.png){width=2%}](mailto:s.w.vanderlaan@gmail.com)"
date: "Version 2.0.2 (2024-04-04)"
description: "This is a practical primer in human complex genetics with a use-case in cardiovascular disease. The output format for this primer is bookdown::gitbook."
documentclass: book
github-repo: swvanderlaan/A_Practical_Primer_in_Human_Complex_Genetics
link-citations: yes
bibliography:
- bibliography/book.bib
- bibliography/packages.bib
biblio-style: apalike
site: bookdown::bookdown_site
always_allow_html: yes
#cover-image: "images/cover.png"
#apple-touch-icon: "touch-icon.png"
#apple-touch-icon-size: 120
#favicon: "favicon.ico"
---

# About this primer

Ever since the first genome-wide association study (GWAS) on [age-related macular degeneration](https://doi.org/10.1126/science.1109557){target="_blank"}, and the promise of personalized medicine in the wake of the Human Genome Project, large-scale genetic association studies hold significant sway in contemporary health research and [drive drug-development pipelines](http://dx.doi.org/10.1038/nrd.2017.262){target="_blank"}. In the past 2 decades, researchers delved into GWAS, aiming to unveil genetic variations linked to both human traits, such as the color of your eyes, and rare and common complex diseases. These findings serve as crucial keys to unravel the intricate mechanisms underlying diseases, shedding light on whether the correlations identified in observational studies between risk factors and diseases are truly causal. 

These studies have ushered in an exciting era where many researchers thrive on developing new methods and bioinformatic tools to parse ever-growing large datasets collected large population-based biobanks. However, the analyses of these data are challenging and it can be daunting to see the forest for tree among the many tools and their various functions. Enter _A Practical Primer in Human Complex Genetics_. This [GitBook](https://cjvanlissa.github.io/gitbook-demo/){target="_blank"} was originally written back in 2022 for the **Genetic Epidemiology** course organized by the [Master Epidemiology](https://epidemiology-education.nl){target="_blank"} of Utrecht University. This practical guide will teach you how to design a GWAS, perform quality control (QC), execute the actual analyses, annotate the GWAS results, and perform further downstream post-GWAS analyses. Throughout the book you'll work with 'dummy', that is fake, data, but in the end, we will use real-world data from the first release of the [*Welcome Trust Case-Control Consortium (WTCCC)*](https://www.wtccc.org.uk/ccc1/overview.html){target="_blank"} focusing on coronary artery disease (CAD). 

A major component of modern-day GWAS is [genetic imputation](https://www.nature.com/articles/nrg2796){target="_blank"}, but for practical reasons it is not part of this book. However, I will provide some pointers as to how to go about do this with minimal coding or scripting experience. Likewise, the courses does not cover the aspects of meta-analyses of GWAS, but some excellent resources exist to which I will direct. As this practical primer evolves, these and other topics may find their place in this book. 
I should also point out that emphasis of this book is on it being a _practical primer_. It is intended to provide some practical guidance to doing GWAS, and while theory is important, I will not cover this. Again, some very useful and excellent work exists to which I will point you, but I really want you to learn - and understand the theory - by _doing_. 

So, although originally crafted as a companion for the course, this practical guide stands on its own as a comprehensive resource for diving into all facets of doing a GWAS — save for experimental follow-up, of course 😉.

I can imagine this seems overwhelming, but trust me, you'll be okay. Just follow this practical. You'll learn by doing and at the end of the day, you can execute a GWAS independently.

**Ready to start?**

<!-- Your first point of action is to prepare your system for this course in Chapter \@ref(somebackgroundreading). -->

<script>
title=document.getElementById('header');
title.innerHTML = '<img src="img/_headers/banner_man_standing_dna.png" alt="A Practical Primer in Human Complex Genetics">' + title.innerHTML
</script>

<!--chapter:end:index.Rmd-->

# Steps in a Genome-Wide Association Study {#gwas-basics}
<!-- ![](./img/_headers/woman_working_on_code.png){width=100%} -->






Now that you understand a bit of the navigation in Unix-systems, you're ready to start this practical primer. We will make use of a dummy dataset containing cases and controls. We will explain and execute the following steps:

1. convert raw data to a more memory-efficient format
2. apply extensive quality control on samples and SNPs
3. assess the ancestral background of your study population
4. perform association testing
5. visualize association results


## Converting datasets

The format in which genotype data are returned to investigators varies among genome-wide SNP platforms and genotyping centers. Usually genotypes have been called by a genotyping center and returned in the standard `PED` and `MAP` file formats designed for `PLINK`.

A `PED` file is a white space (space or tab)-delimited file in which each line represents one individual and the first six columns are mandatory and in the following order:

- 'Family ID', 
- 'Individual ID', 
- 'Paternal ID', 
- 'Maternal ID', 
- 'Sex (1=male, 2=female, 0=missing)', and 
- 'Phenotype (1=unaffected, 2=affected, 0=missing)'. 

The subsequent columns denote genotypes that can be any character (e.g., 1, 2, 3, 4 or A, C, G, T). Zero denotes a missing genotype. Each SNP must have two alleles (i.e., both alleles are either present or absent). 
The order of SNPs in the PED file is given in the MAP file, in which each line denotes a single marker and the four white-space–separated columns are chromosome (1–22, X, Y or 0 for unplaced), marker name (typically an rs number), genetic distance in Morgans (this can be fixed to 0) and base-pair position (bp units).

Let's start by using `PLINK` to converting the datasets to a lighter, binary form (a `.bed`-file). This file saves data in a more memory- and time-efficient manner (in a 'binary'-format) to facilitate the analysis of large-scale data sets [@purcell2007]. The marker-information is stored in the `.bim`-file and the family information in the `.fam`-file. `PLINK` creates a `.log` file (named `raw-GWA-data.log`) that details (among other information) the implemented commands, the number of cases and controls in the input files, any excluded data and the genotyping rate in the remaining data. This file is very useful for checking whether the software is successfully completing commands.

Make sure you are in the right directory. Do you remember how to get there?

```
cd ~/Desktop/practical
```

Next, we'll make a project directory.

```
mkdir -v ~/Desktop/practical/dummy_project
```

Now, we'll convert the `.ped`/`.map` files to the binary-format.

```
plink --file rawdata/raw-GWA-data --make-bed --out dummy_project/rawdata
```

Let's review the `.log`-file for a bit. It should look something like this:

```
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to dummy_project/rawdata.log.
Options in effect:
  --file rawdata/raw-GWA-data
  --make-bed
  --out dummy_project/rawdata

16384 MB RAM detected; reserving 8192 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (317503 variants, 2000 people).
--file: dummy_project/rawdata-temporary.bed +
dummy_project/rawdata-temporary.bim + dummy_project/rawdata-temporary.fam
written.
317503 variants loaded from .bim file.
2000 people (997 males, 1003 females) loaded from .fam.
2000 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2000 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 11440 het. haploid genotypes present (see dummy_project/rawdata.hh );
many commands treat these as missing.
Total genotyping rate is 0.985682.
317503 variants and 2000 people pass filters and QC.
Among remaining phenotypes, 1023 are cases and 977 are controls.
--make-bed to dummy_project/rawdata.bed + dummy_project/rawdata.bim +
dummy_project/rawdata.fam ... done.
```

So, there are 317,503 variants included for 2,000 people, 997 males and 1,003 females. All of these individuals are ['founders'](https://www.cog-genomics.org/plink/1.9/filter#nonfounders){target="_blank"}. There are 1,023 cases and 977 controls. The genotyping rate is about 98.6% which is pretty good. Lastly, there are 11,440 heterozygous haploid genotypes present.

> Question: Can you think off what the '11,440 heterozygous haploid genotypes present' represent? 


## Quality control

We are ready for some quality control and quality assurance, heavily inspired by Anderson _et al._ [@anderson2010] and Laurie _et al._ [@laurie2010]. In general, we should check out a couple of things regarding the data quality on two levels:

1) samples
2) variants

So, we will investigate the following:

- Are the *sexes* based on genetic data matching the ones given by the phenotype file?
- Identify individuals that are outliers in terms of missing data (_call rate_) or heterozygosity rates. This could indicate a genotyping error or sample swap.
- Identify duplicated or related individuals.
- Identify individuals with divergent ancestry.
- What are the allele frequencies?
- What is the per-SNP call rate?
- In the case of a case-control study (which is the case here), we need to check differential missingness between cases and controls. 

> Question: Can you think of other scenarios in which you may want to extend the check on differential missingness beyond a check between cases and controls?


## Let's get our hands dirty

All clear? Let's start the work. On to step 1 of the QC for GWAS: filter samples of poor quality in Chapter \@ref(gwas-basics-sample-qc).

<!-- ```{js, echo = FALSE} -->
<!-- title=document.getElementById('header'); -->
<!-- title.innerHTML = '<img src="img/_headers/woman_working_on_code.png" alt="GWAS basics">' + title.innerHTML -->
<!-- ``` -->

<!--chapter:end:03_1_gwas_basics.Rmd-->

# Sample QC {#gwas-basics-sample-qc}
<!-- ![](./img/_gwas_dummy/gwas_sample_qc.png){width=70%} -->






Let's start with the per-sample quality control. The goal here is to identify all the samples that for some reason (mentioned in the previous section) should be excluded. 

## Sex
We need to identify of individuals with discordant sex information comparing phenotypic and genotypic data. Let's calculate the mean homozygosity rate across X-chromosome markers for each individual in the study.

```
plink --bfile dummy_project/rawdata --check-sex --out dummy_project/rawdata
```

This produces a file with the following columns:

- _FID_	Family ID
- _IID_	Within-family ID
- _PEDSEX_	Sex code in input file
- _SNPSEX_	Imputed sex code (1 = male, 2 = female, 0 = unknown)
- _STATUS_	'OK' if PEDSEX and SNPSEX match and are nonzero, 'PROBLEM' otherwise
- _F_	Inbreeding coefficient, considering only X chromosome. Not present with 'y-only'.
- _YCOUNT_	Number of nonmissing genotype calls on Y chromosome. Requires 'ycount'/'y-only'.


We need to get a list of individuals with discordant sex data.

```
cat dummy_project/rawdata.sexcheck | awk '$5 =="STATUS" || $5 =="PROBLEM"'  > dummy_project/rawdata.sexprobs.txt
```

Let's have a look at the results.

```
cat dummy_project/rawdata.sexprobs.txt
```






```{=html}
<div class="tabwid"><style>.cl-e0d1ef08{}.cl-e0cc032c{font-family:'Helvetica';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-e0ce276a{margin:0;text-align:right;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-e0ce2774{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-e0ce38c2{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38c3{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38c4{width:0.932in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38cc{width:0.652in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38cd{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38ce{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38d6{width:0.932in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38d7{width:0.652in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38d8{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38e0{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38e1{width:0.932in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0ce38e2{width:0.652in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-e0d1ef08'>

```

<caption style="display:table-caption;">(\#tab:sexissues)<span>Sex issues.</span></caption>

```{=html}

<thead><tr style="overflow-wrap:break-word;"><th class="cl-e0ce38c2"><p class="cl-e0ce276a"><span class="cl-e0cc032c">FID</span></p></th><th class="cl-e0ce38c2"><p class="cl-e0ce276a"><span class="cl-e0cc032c">IID</span></p></th><th class="cl-e0ce38c3"><p class="cl-e0ce276a"><span class="cl-e0cc032c">PEDSEX</span></p></th><th class="cl-e0ce38c3"><p class="cl-e0ce276a"><span class="cl-e0cc032c">SNPSEX</span></p></th><th class="cl-e0ce38c4"><p class="cl-e0ce2774"><span class="cl-e0cc032c">STATUS</span></p></th><th class="cl-e0ce38cc"><p class="cl-e0ce276a"><span class="cl-e0cc032c">F</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-e0ce38cd"><p class="cl-e0ce276a"><span class="cl-e0cc032c">772</span></p></td><td class="cl-e0ce38cd"><p class="cl-e0ce276a"><span class="cl-e0cc032c">772</span></p></td><td class="cl-e0ce38ce"><p class="cl-e0ce276a"><span class="cl-e0cc032c">2</span></p></td><td class="cl-e0ce38ce"><p class="cl-e0ce276a"><span class="cl-e0cc032c">0</span></p></td><td class="cl-e0ce38d6"><p class="cl-e0ce2774"><span class="cl-e0cc032c">PROBLEM</span></p></td><td class="cl-e0ce38d7"><p class="cl-e0ce276a"><span class="cl-e0cc032c">0.3084</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0ce38cd"><p class="cl-e0ce276a"><span class="cl-e0cc032c">853</span></p></td><td class="cl-e0ce38cd"><p class="cl-e0ce276a"><span class="cl-e0cc032c">853</span></p></td><td class="cl-e0ce38ce"><p class="cl-e0ce276a"><span class="cl-e0cc032c">2</span></p></td><td class="cl-e0ce38ce"><p class="cl-e0ce276a"><span class="cl-e0cc032c">0</span></p></td><td class="cl-e0ce38d6"><p class="cl-e0ce2774"><span class="cl-e0cc032c">PROBLEM</span></p></td><td class="cl-e0ce38d7"><p class="cl-e0ce276a"><span class="cl-e0cc032c">0.3666</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0ce38d8"><p class="cl-e0ce276a"><span class="cl-e0cc032c">1,920</span></p></td><td class="cl-e0ce38d8"><p class="cl-e0ce276a"><span class="cl-e0cc032c">1,920</span></p></td><td class="cl-e0ce38e0"><p class="cl-e0ce276a"><span class="cl-e0cc032c">2</span></p></td><td class="cl-e0ce38e0"><p class="cl-e0ce276a"><span class="cl-e0cc032c">0</span></p></td><td class="cl-e0ce38e1"><p class="cl-e0ce2774"><span class="cl-e0cc032c">PROBLEM</span></p></td><td class="cl-e0ce38e2"><p class="cl-e0ce276a"><span class="cl-e0cc032c">0.4066</span></p></td></tr></tbody></table></div>
```


When the homozygosity rate (_F_) is more than 0.2, but less than 0.8, the genotype data are inconclusive regarding the sex of an individual and these are marked in column _SNPSEX_ with a 0, and the column _STATUS_ "PROBLEM".

Report the IDs of individuals with discordant sex information (Table \@ref(tab:sexissues)) to those who conducted sex phenotyping. In situations in which discrepancy cannot be resolved, add the family ID (FID) and individual ID (IID) of the samples to a file named `fail-sexcheck-qc.txt` (one individual per line, tab delimited).

```
grep "PROBLEM" dummy_project/rawdata.sexcheck | awk '{ print $1, $2}'  > dummy_project/fail-sexcheck-qc.txt
```

## Sample call rates
Let's get an overview of the missing data per sample and per SNP.

```
plink --bfile dummy_project/rawdata --missing --out dummy_project/rawdata
```

This produces two files, `rawdata/rawdata.imiss` and `rawdata/rawdata.lmiss`. In the `.imiss`-file the _N_MISS_ column denotes the number of missing SNPs, and the _F_MISS_ column denotes the proportion of missing SNPs per individual.






The grey dashed line in Figure \@ref(fig:showsamplecallrate) indicates the mean call rate, while the red dashed line indicates the threshold we had determined above.

<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/gwas-qc-sample-callrate.png" alt="Per sample call rate." width="85%" />
<p class="caption">(\#fig:showsamplecallrate)Per sample call rate.</p>
</div>


## Heterozygosity rate

To properly calculate heterozygosity rate and relatedness (identity-by-descent [IBD]) we need to do four things:

1) pre-clean the data to get a high-quality set,
2) of independent SNPs,
3) exclude long-range linkage disequilibrium (LD) blocks that bias with these calculations, and
4) exclude A/T and C/G SNPs as these may be ambivalent in interpretation when frequencies between cases and controls are close (MAF ± 0.45),
5) remove all non-autosomal SNPs.

You can find an up-to-date list of LD blocks you should exclude in these types of analyses [here](https://genome.sph.umich.edu/wiki/Regions_of_high_linkage_disequilibrium_(LD)){target="_blank"} for the different genome builds. In this case we are using build 37. For the purpose of this book we included a file with these regions in the `support`-directory.

We will use the following settings:

- remove A/T and C/G SNPs with the flag `--exclude dummy_project/all.atcg.variants.txt`,
- call rate <1% with the flag `--geno 0.10`,
- Hardy-Weinberg Equilibrium (HWE) p-value > 1x10-3 with the flag `--hwe 1e-3`,
- and MAF>10% with the flag `--maf 0.10 `,
- prune the data to only select independent SNPs (with low LD r^2) of one pair each with `r^2 = 0.2` with the flags `--indep-pairwise 100 10 0.2` and `--extract rawdata/raw-GWA-data.prune.in`,
- SNPs in long-range LD regions (for example: MHC chr 6 25.8-36Mb, chr 8 inversion 6-16Mb, chr17 40-45Mb, and a few more) with the flag `--exclude range support/exclude_problematic_range.txt`,
- remove non-autosomal SNPs with the flag `--autosome`.

First, get a list of A/T and C/G SNPs. Remember, the list of markers for this GWAS is noted in the `.bim` file. We can simply grep all the lines where the two alleles either have an A/T or C/G combination.

```
cat dummy_project/rawdata.bim | \
awk '($5 == "A" && $6 == "T") || ($5 == "T" && $6 == "A") || ($5 == "C" && $6 == "G") || ($5 == "G" && $6 == "C")' | awk '{ print $2, $1, $4, $3, $5, $6 }' \
> dummy_project/all.atcg.variants.txt
```

Second, clean the data and get a list of independent SNPs.

<!-- *NEXT YEAR: further explain the exact commands  and how they relate with the above. Is this needed? * -->

```
plink --bfile dummy_project/rawdata \
--autosome \
--maf 0.10 --geno 0.10 --hwe 1e-3 \
--indep-pairwise 100 10 0.2 \
--exclude range support/exclude_problematic_range.txt \
--make-bed --out dummy_project/rawdata.clean.temp
```

> Please note, we have create a dataset without taking into account LD structure. Hence, the message 'Pruned 0 variants from chromosome 1, leaving 19420.' etc. In a dataset without any LD structure this flag `--indep-pairwise 100 10 0.2` doesn't actually work. However, with real-data you can use it to prune out unwanted SNPs in high LD.

Third, exclude the pruned SNPs. Note, how we include a file to exclude high-LD for the purpose of the practical.

```
plink --bfile dummy_project/rawdata.clean.temp \
--extract rawdata/raw-GWA-data.prune.in \
--make-bed --out dummy_project/rawdata.clean.ultraclean.temp
```

Fourth, remove the A/T and C/G SNPs.

```
plink --bfile dummy_project/rawdata.clean.ultraclean.temp \
--exclude dummy_project/all.atcg.variants.txt \
--make-bed --out dummy_project/rawdata.clean.ultraclean
```

> Please note, this dataset doesn't actually include this type of SNP, hence `rawdata/all.atcg.variants.txt` is empty! Again, you can use this command in real-data to exclude A/T and C/G SNPs.

Lastly, remove the temporary files.

```
rm -fv dummy_project/*.temp*
```

Finally, we can calculate the heterozygosity rate.

```
plink --bfile dummy_project/rawdata.clean.ultraclean --het --out dummy_project/rawdata.clean.ultraclean
```

This creates the file `dummy_project/rawdata.clean.ultraclean.het`, in which the third column denotes the observed number of homozygous genotypes, O(Hom), and the fifth column denotes the number of nonmissing genotypes, N(NM), per individual. We can now calculate the observed heterozygosity rate per individual using the formula (N(NM) - O(Hom))/N(NM).

Often there is a correlation between heterozygosity rate and missing data. Thus, we should plot the observed heterozygosity rate per individual on the x-axis and the proportion of missing SNP, that is the 'SNP call rate', per individuals on the y-axis (Figure \@ref(fig:showheterozygosity)).






<!-- I prefer to use ggpubr, but I don't understand why this code doesn't work anymore...?!?! -->
<!-- ```{r heterozygosity, eval = FALSE} -->
<!-- ggpubr::ggscatter(raw_IMISSHET, x = "logF_MISS", y = "meanHet", -->
<!--                   colors = colors, -->
<!--                   xlab = "Proportion of missing genotypes", ylab = "Heterozygosity rate") + -->
<!--   scale_x_continuous(labels=c("-3" = "0.001", "-2" = "0.01", -->
<!--                               "-1" = "0.1", "0" = "1")) + -->
<!--   geom_hline(yintercept = lower_meanHet, linetype = "dashed", -->
<!--                 color = "#E55738", size = 1) + -->
<!--   geom_hline(yintercept = upper_meanHet, linetype = "dashed", -->
<!--                 color = "#E55738", size = 1) + -->
<!--   geom_vline(xintercept = prop_miss, linetype = "dashed", -->
<!--                 color = "#E55738", size = 1) -->
<!-- ``` -->

<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-heterozygosity.png" alt="Heterozygosity as a function of SNP call rate." width="85%" />
<p class="caption">(\#fig:showheterozygosity)Heterozygosity as a function of SNP call rate.</p>
</div>

Examine the plot (Figure \@ref(fig:showheterozygosity)) to decide reasonable thresholds at which to exclude individuals based on elevated missing or extreme heterozygosity. We chose to exclude all individuals with a genotype failure rate >= 0.03 (vertical dashed line) and/or a heterozygosity rate ± 3 s.d. from the mean (horizontal dashed lines). Add the FID and IID of the samples failing this QC to the file named `fail-imisshet-qc.txt`.

> How would you create this file?



If all is right, you'd have something like Table \@ref(tab:failedcallratehet).




```{=html}
<div class="tabwid"><style>.cl-e0eeea0e{}.cl-e0e770d0{font-family:'Helvetica';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-e0e95562{margin:0;text-align:right;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-e0e96336{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96337{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96354{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e9635e{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e9635f{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96360{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96368{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96369{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e9636a{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96372{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96373{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e9637c{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e9637d{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96386{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96387{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e0e96390{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-e0eeea0e'>

```

<caption style="display:table-caption;">(\#tab:failedcallratehet)<span>Failed samples due to sample call rates and heterozygosity rate.</span></caption>

```{=html}

<thead><tr style="overflow-wrap:break-word;"><th class="cl-e0e96336"><p class="cl-e0e95562"><span class="cl-e0e770d0">FID</span></p></th><th class="cl-e0e96336"><p class="cl-e0e95562"><span class="cl-e0e770d0">IID</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">28</span></p></td><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">28</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96354"><p class="cl-e0e95562"><span class="cl-e0e770d0">50</span></p></td><td class="cl-e0e96354"><p class="cl-e0e95562"><span class="cl-e0e770d0">50</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9635e"><p class="cl-e0e95562"><span class="cl-e0e770d0">52</span></p></td><td class="cl-e0e9635e"><p class="cl-e0e95562"><span class="cl-e0e770d0">52</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9635e"><p class="cl-e0e95562"><span class="cl-e0e770d0">53</span></p></td><td class="cl-e0e9635e"><p class="cl-e0e95562"><span class="cl-e0e770d0">53</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9635f"><p class="cl-e0e95562"><span class="cl-e0e770d0">67</span></p></td><td class="cl-e0e9635f"><p class="cl-e0e95562"><span class="cl-e0e770d0">67</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">96</span></p></td><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">96</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96360"><p class="cl-e0e95562"><span class="cl-e0e770d0">205</span></p></td><td class="cl-e0e96360"><p class="cl-e0e95562"><span class="cl-e0e770d0">205</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">298</span></p></td><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">298</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9635e"><p class="cl-e0e95562"><span class="cl-e0e770d0">310</span></p></td><td class="cl-e0e9635e"><p class="cl-e0e95562"><span class="cl-e0e770d0">310</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9635e"><p class="cl-e0e95562"><span class="cl-e0e770d0">351</span></p></td><td class="cl-e0e9635e"><p class="cl-e0e95562"><span class="cl-e0e770d0">351</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9635f"><p class="cl-e0e95562"><span class="cl-e0e770d0">404</span></p></td><td class="cl-e0e9635f"><p class="cl-e0e95562"><span class="cl-e0e770d0">404</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">418</span></p></td><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">418</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96360"><p class="cl-e0e95562"><span class="cl-e0e770d0">423</span></p></td><td class="cl-e0e96360"><p class="cl-e0e95562"><span class="cl-e0e770d0">423</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">428</span></p></td><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">428</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96369"><p class="cl-e0e95562"><span class="cl-e0e770d0">442</span></p></td><td class="cl-e0e96369"><p class="cl-e0e95562"><span class="cl-e0e770d0">442</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9636a"><p class="cl-e0e95562"><span class="cl-e0e770d0">444</span></p></td><td class="cl-e0e9636a"><p class="cl-e0e95562"><span class="cl-e0e770d0">444</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">548</span></p></td><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">548</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96369"><p class="cl-e0e95562"><span class="cl-e0e770d0">772</span></p></td><td class="cl-e0e96369"><p class="cl-e0e95562"><span class="cl-e0e770d0">772</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">835</span></p></td><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">835</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">850</span></p></td><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">850</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">853</span></p></td><td class="cl-e0e96368"><p class="cl-e0e95562"><span class="cl-e0e770d0">853</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96372"><p class="cl-e0e95562"><span class="cl-e0e770d0">910</span></p></td><td class="cl-e0e96372"><p class="cl-e0e95562"><span class="cl-e0e770d0">910</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96372"><p class="cl-e0e95562"><span class="cl-e0e770d0">939</span></p></td><td class="cl-e0e96372"><p class="cl-e0e95562"><span class="cl-e0e770d0">939</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">966</span></p></td><td class="cl-e0e96337"><p class="cl-e0e95562"><span class="cl-e0e770d0">966</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96373"><p class="cl-e0e95562"><span class="cl-e0e770d0">968</span></p></td><td class="cl-e0e96373"><p class="cl-e0e95562"><span class="cl-e0e770d0">968</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,003</span></p></td><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,003</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637d"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,006</span></p></td><td class="cl-e0e9637d"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,006</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,045</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,045</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,058</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,058</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,154</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,154</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637d"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,236</span></p></td><td class="cl-e0e9637d"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,236</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96387"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,294</span></p></td><td class="cl-e0e96387"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,294</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,395</span></p></td><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,395</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,537</span></p></td><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,537</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,554</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,554</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,587</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,587</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637d"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,694</span></p></td><td class="cl-e0e9637d"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,694</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,789</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,789</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96387"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,832</span></p></td><td class="cl-e0e96387"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,832</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637d"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,866</span></p></td><td class="cl-e0e9637d"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,866</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,904</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,904</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96387"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,920</span></p></td><td class="cl-e0e96387"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,920</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,951</span></p></td><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,951</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96387"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,952</span></p></td><td class="cl-e0e96387"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,952</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,953</span></p></td><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,953</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,954</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,954</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,955</span></p></td><td class="cl-e0e9637c"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,955</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,981</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,981</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,984</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,984</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,985</span></p></td><td class="cl-e0e96386"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,985</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e0e96390"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,986</span></p></td><td class="cl-e0e96390"><p class="cl-e0e95562"><span class="cl-e0e770d0">1,986</span></p></td></tr></tbody></table></div>
```


## Relatedness
<!-- https://rpubs.com/EAVWing/symbols -->
We calculate Identity-by-Descent (IBS) to identify duplicated and related samples. In Table \@ref(tab:showrelatedness) we show how much DNA is shared between individuals depending on their relation[@staples2014]. IBS is measured by calculating pi-hat ($\widehat{\pi}$), which is in essence the proportion of the DNA that a pair of samples share. To calculate this, we needed this ultraclean dataset, without low-quality SNPs and without high-LD regions. 





```{=html}
<div class="tabwid"><style>.cl-e1164054{}.cl-e1106710{font-family:'Helvetica';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-e1129b8e{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-e1129b8f{margin:0;text-align:right;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-e112ac3c{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac3d{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac46{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac47{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac50{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac51{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac5a{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac5b{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac64{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac65{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac6e{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac6f{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac78{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac79{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac82{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac83{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac8c{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac8d{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac96{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac97{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ac98{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112aca0{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112aca1{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acaa{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acab{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acb4{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acb5{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acb6{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acbe{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acbf{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acc0{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acc8{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acd2{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acd3{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acd4{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acdc{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acdd{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ace6{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ace7{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ace8{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acf0{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acf1{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acfa{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acfb{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112acfc{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad04{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad05{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad0e{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad0f{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad18{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad19{width:4.109in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad22{width:1.23in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad23{width:0.737in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad2c{width:0.805in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e112ad2d{width:0.525in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-e1164054'>

```

<caption style="display:table-caption;">(\#tab:showrelatedness)<span>Familial relations and % DNA shared.</span></caption>

```{=html}

<thead><tr style="overflow-wrap:break-word;"><th class="cl-e112ac3c"><p class="cl-e1129b8e"><span class="cl-e1106710">Relatedness</span></p></th><th class="cl-e112ac3d"><p class="cl-e1129b8e"><span class="cl-e1106710">%.DNA.sharing</span></p></th><th class="cl-e112ac46"><p class="cl-e1129b8e"><span class="cl-e1106710">IBD0</span></p></th><th class="cl-e112ac47"><p class="cl-e1129b8e"><span class="cl-e1106710">IBD1</span></p></th><th class="cl-e112ac50"><p class="cl-e1129b8f"><span class="cl-e1106710">IBD2</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-e112ac51"><p class="cl-e1129b8e"><span class="cl-e1106710">Monozygotic twins          </span></p></td><td class="cl-e112ac5a"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±100%</span></p></td><td class="cl-e112ac5b"><p class="cl-e1129b8e"><span class="cl-e1106710">0</span></p></td><td class="cl-e112ac64"><p class="cl-e1129b8e"><span class="cl-e1106710">1</span></p></td><td class="cl-e112ac65"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112ac6e"><p class="cl-e1129b8e"><span class="cl-e1106710">Parents/child              </span></p></td><td class="cl-e112ac6f"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±50%</span></p></td><td class="cl-e112ac78"><p class="cl-e1129b8e"><span class="cl-e1106710">0.25</span></p></td><td class="cl-e112ac79"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112ac82"><p class="cl-e1129b8f"><span class="cl-e1106710">0.25</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112ac83"><p class="cl-e1129b8e"><span class="cl-e1106710">Sibling                    </span></p></td><td class="cl-e112ac8c"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±50%</span></p></td><td class="cl-e112ac8d"><p class="cl-e1129b8e"><span class="cl-e1106710">0.25</span></p></td><td class="cl-e112ac96"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112ac97"><p class="cl-e1129b8f"><span class="cl-e1106710">0.25</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112ac98"><p class="cl-e1129b8e"><span class="cl-e1106710">Fraternal twins            </span></p></td><td class="cl-e112aca0"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±50%</span></p></td><td class="cl-e112aca1"><p class="cl-e1129b8e"><span class="cl-e1106710">0.25</span></p></td><td class="cl-e112acaa"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112acab"><p class="cl-e1129b8f"><span class="cl-e1106710">0.25</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112ac83"><p class="cl-e1129b8e"><span class="cl-e1106710">Grandparent/grandchild     </span></p></td><td class="cl-e112ac8c"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±25%</span></p></td><td class="cl-e112ac8d"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112ac96"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112ac97"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112acb4"><p class="cl-e1129b8e"><span class="cl-e1106710">Aunt/Uncle/Niece/Nephew    </span></p></td><td class="cl-e112acb5"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±25%</span></p></td><td class="cl-e112acb6"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112acbe"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112acbf"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112acc0"><p class="cl-e1129b8e"><span class="cl-e1106710">Half-sibling               </span></p></td><td class="cl-e112acc8"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±25%</span></p></td><td class="cl-e112acd2"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112acd3"><p class="cl-e1129b8e"><span class="cl-e1106710">0.5</span></p></td><td class="cl-e112acd4"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112ac98"><p class="cl-e1129b8e"><span class="cl-e1106710">First-cousin               </span></p></td><td class="cl-e112aca0"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±12.5%</span></p></td><td class="cl-e112aca1"><p class="cl-e1129b8e"><span class="cl-e1106710">0.75</span></p></td><td class="cl-e112acaa"><p class="cl-e1129b8e"><span class="cl-e1106710">0.25</span></p></td><td class="cl-e112acab"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112acdc"><p class="cl-e1129b8e"><span class="cl-e1106710">Half first-cousin          </span></p></td><td class="cl-e112acdd"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±6.25%</span></p></td><td class="cl-e112ace6"><p class="cl-e1129b8e"><span class="cl-e1106710">0.875</span></p></td><td class="cl-e112ace7"><p class="cl-e1129b8e"><span class="cl-e1106710">0.125</span></p></td><td class="cl-e112ace8"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112ac6e"><p class="cl-e1129b8e"><span class="cl-e1106710">First-cousin once removed  </span></p></td><td class="cl-e112ac6f"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±6.25%</span></p></td><td class="cl-e112ac78"><p class="cl-e1129b8e"><span class="cl-e1106710">0.875</span></p></td><td class="cl-e112ac79"><p class="cl-e1129b8e"><span class="cl-e1106710">0.125</span></p></td><td class="cl-e112ac82"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112acf0"><p class="cl-e1129b8e"><span class="cl-e1106710">Second-cousin              </span></p></td><td class="cl-e112acf1"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±3.13%</span></p></td><td class="cl-e112acfa"><p class="cl-e1129b8e"><span class="cl-e1106710">0.9375</span></p></td><td class="cl-e112acfb"><p class="cl-e1129b8e"><span class="cl-e1106710">6.25E-2</span></p></td><td class="cl-e112acfc"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112acf0"><p class="cl-e1129b8e"><span class="cl-e1106710">Second-cousin once removed </span></p></td><td class="cl-e112acf1"><p class="cl-e1129b8e"><span class="cl-e1106710"> ±1.56%</span></p></td><td class="cl-e112acfa"><p class="cl-e1129b8e"><span class="cl-e1106710">0.96875</span></p></td><td class="cl-e112acfb"><p class="cl-e1129b8e"><span class="cl-e1106710">3.125E-2</span></p></td><td class="cl-e112acfc"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112ad04"><p class="cl-e1129b8e"><span class="cl-e1106710">Distantly related</span></p></td><td class="cl-e112ad05"><p class="cl-e1129b8e"><span class="cl-e1106710">&lt;1.56%</span></p></td><td class="cl-e112ad0e"><p class="cl-e1129b8e"><span class="cl-e1106710">varies</span></p></td><td class="cl-e112ad0f"><p class="cl-e1129b8e"><span class="cl-e1106710">varies</span></p></td><td class="cl-e112ad18"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e112ad19"><p class="cl-e1129b8e"><span class="cl-e1106710">Unrelated (includes relationships beyond the third degree)</span></p></td><td class="cl-e112ad22"><p class="cl-e1129b8e"><span class="cl-e1106710">&lt;1.56%</span></p></td><td class="cl-e112ad23"><p class="cl-e1129b8e"><span class="cl-e1106710">1</span></p></td><td class="cl-e112ad2c"><p class="cl-e1129b8e"><span class="cl-e1106710">0</span></p></td><td class="cl-e112ad2d"><p class="cl-e1129b8f"><span class="cl-e1106710">0.00</span></p></td></tr></tbody></table></div>
```

`PLINK` calculates the inter-individual relatedness using the `--genome` function.

```
plink --bfile dummy_project/rawdata.clean.ultraclean --genome --out dummy_project/rawdata.clean.ultraclean
```

We can now identify all pairs of individuals with an IBD > 0.185. The code looks at the individual call rates stored in `rawdata.imiss` and outputs the IDs of the individual with the lowest call rate to ‘fail-IBD-QC.txt’ for subsequent removal (Table \@ref(tab:showibdcallissues)).

First, move to the `dummy_project` directory.

```
cd dummy_project
```

Now, execute this script - it should work just fine out-of-the-box.

```
perl ../scripts/run-IBD-QC.pl rawdata rawdata.clean.ultraclean
```

Go back one directory.

```
cd ..
```






```{=html}
<div class="tabwid"><style>.cl-e129072a{}.cl-e1247f02{font-family:'Helvetica';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-e1264df0{margin:0;text-align:right;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-e1265aac{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e1265ab6{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e1265ac0{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e1265ac1{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e1265ac2{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-e1265aca{width:0.568in;background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-e129072a'>

```

<caption style="display:table-caption;">(\#tab:showibdcallissues)<span>Failed IBD and callrate.</span></caption>

```{=html}

<thead><tr style="overflow-wrap:break-word;"><th class="cl-e1265aac"><p class="cl-e1264df0"><span class="cl-e1247f02">FID</span></p></th><th class="cl-e1265aac"><p class="cl-e1264df0"><span class="cl-e1247f02">IID</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-e1265ab6"><p class="cl-e1264df0"><span class="cl-e1247f02">1,952</span></p></td><td class="cl-e1265ab6"><p class="cl-e1264df0"><span class="cl-e1247f02">1,952</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,953</span></p></td><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,953</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac1"><p class="cl-e1264df0"><span class="cl-e1247f02">1,954</span></p></td><td class="cl-e1265ac1"><p class="cl-e1264df0"><span class="cl-e1247f02">1,954</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,955</span></p></td><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,955</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,957</span></p></td><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,957</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,959</span></p></td><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,959</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,961</span></p></td><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,961</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,963</span></p></td><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,963</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,965</span></p></td><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,965</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,967</span></p></td><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,967</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,969</span></p></td><td class="cl-e1265ac2"><p class="cl-e1264df0"><span class="cl-e1247f02">1,969</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,971</span></p></td><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,971</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,973</span></p></td><td class="cl-e1265ac0"><p class="cl-e1264df0"><span class="cl-e1247f02">1,973</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-e1265aca"><p class="cl-e1264df0"><span class="cl-e1247f02">1,975</span></p></td><td class="cl-e1265aca"><p class="cl-e1264df0"><span class="cl-e1247f02">1,975</span></p></td></tr></tbody></table></div>
```

## Ancestral background

Using a **Principal Component Analysis (PCA)** we can reduce the dimensions of the data, and project the "ancestral distances". In other words, the principal component 1 (the first dimension) and principal component 2 (the second dimension) which will capture most of the variation in the data and represent how much each sample is alike the next. And when compared to a reference, you can deduce the ancestral background of each sample in your dataset. Of course this is relative: we will only know that a given sample is very much a like samples from a given population that _exists today_. 

Nowadays we run such PCA against a large and diverse dataset containing many different populations. Old-school GWAS (pre-2009) would compare a dataset against [HapMap 3](https://www.broadinstitute.org/medical-and-population-genetics/hapmap-3){target="_blank"}, nowadays we prefer at a minimum the [1000G phase 3 populations](https://www.internationalgenome.org){target="_blank"}. And in those ancient times the preferred software to run a PCA was _Eigensoft_ which is a bit tricky to install (see Chapter \@ref(eigensoft)), but nowadays `PLINK` provides the `--pca`-flag. 

For the purpose of this practical primer we will run PCA using `PLINK` and its `--pca`-flag against an earlier version of 1000G, phase 1, which is slightly smaller and just as good to use. 

### 1000G phase 1

We will project our data to the reference, in this example 1000G phase 1 (1000G), which includes individuals from 14 distinct global populations across 4 'super'-populations (Europeans [`EUR`], Africans [`AFR`], East-Asians [`EAS`], and Latin Americans [`AMR`]). In the real-world, using phase 1 may be just fine, but if you think your population evolved through extensive migration it's probably best to use phase 3 data. In other words, the choice of reference is really depending on the dataset.

First, we will merge our data with 1000G. The alleles at each marker must be aligned to the same DNA strand to allow our data to merge correctly. Because not all SNPs are required for this analysis the A->T and C->G SNPs, which are more difficult to align, can be omitted.

#### Filter the 1000G data

First, we should get a list of relevant variants from our `rawdata`-dataset. We don't need the other variants present in the 1000G dataset, right?

```
cat dummy_project/rawdata.bim | grep "rs" > dummy_project/all.variants.txt
```

Extract those from the 1000G phase 1 data.

```
plink --bfile ref_1kg_phase1_all/1kg_phase1_all --extract dummy_project/all.variants.txt --make-bed --out ref_1kg_phase1_all/1kg_phase1_raw
```

#### Filter A/T & C/G SNPs

As explained, the A/T and C/G SNPs are problematic, we want to exclude these too. So let's get a list of A/T and C/G variants from 1000G to exclude - this may take a while.

```
cat ref_1kg_phase1_all/1kg_phase1_raw.bim | \
awk '($5 == "A" && $6 == "T") || ($5 == "T" && $6 == "A") || ($5 == "C" && $6 == "G") || ($5 == "G" && $6 == "C")' | awk '{ print $2, $1, $4, $3, $5, $6 }' \
> ref_1kg_phase1_all/all.1kg.atcg.variants.txt
```

Exclude those A/T and C/G variants in both datasets and at the same time filter to only retain high-quality data and exclude non-autosomal variants. 

```
plink --bfile ref_1kg_phase1_all/1kg_phase1_raw --exclude ref_1kg_phase1_all/all.1kg.atcg.variants.txt -make-bed --out ref_1kg_phase1_all/1kg_phase1_raw_no_atcg

plink --bfile dummy_project/rawdata --exclude ref_1kg_phase1_all/all.1kg.atcg.variants.txt --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg
```

#### Merging datasets

Try and merge the data.

```
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg --make-bed --out dummy_project/rawdata.1kg_phase1
```

There probably is an error ...

```
Error: 72 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  dummy_project/rawdata.1kg_phase1-merge.missnp.
  (Warning: if the subsequent merge seems to work, strand errors involving SNPs
  with A/T or C/G alleles probably remain in your data.  If LD between nearby
  SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
```

So let's flip some variants.

```
plink --bfile dummy_project/rawdata --exclude ref_1kg_phase1_all/all.1kg.atcg.variants.txt --flip dummy_project/rawdata.1kg_phase1-merge.missnp --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg
```

Let's try again and merge the data.

```
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg --make-bed --out dummy_project/rawdata.1kg_phase1
```

There still is an error -- there are a few multi-allelic variants present which `PLINK` can't handle. 

```
Error: 14 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
  dummy_project/rawdata.1kg_phase1-merge.missnp.
  (Warning: if the subsequent merge seems to work, strand errors involving SNPs
  with A/T or C/G alleles probably remain in your data.  If LD between nearby
  SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
  that subset of the data to VCF (via e.g. '--recode vcf'), merging with
  another tool/script, and then importing the result; PLINK is not yet suited
  to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
```

Let's just remove these multi-allelic variants.

```
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --exclude dummy_project/rawdata.1kg_phase1-merge.missnp --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi
```

After removing those pesky multi-allelic variants, we should be able to merge the data. We should take of the following:

- extract the pruned SNP-set (remember?), `--extract rawdata/raw-GWA-data.prune.i`, 
- exclude non-autosomal variants, `--autosome`, 
- and only keeping high-quality data, `--maf 0.10 --geno 0.10 --hwe 1e-3`, 
<!-- THIS CHUNK COULD BE USED TO REMOVE DUPLICATES, BUT IT DOESN'T WORK IN PLINK 1.9 -->
<!-- - remove non-founders from the 1000G data, `--keep-founders`, -->
<!-- - remove any duplicate variants based on SNPID, `--rm-dup force-first` -->
<!-- - and list any duplicate variants based on chromosomal base pair position, `--list-duplicate-vars` -->

```
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi \
--bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg \
--autosome \
--maf 0.10 --geno 0.10 --hwe 1e-3 \
--extract rawdata/raw-GWA-data.prune.in \
--make-bed --out dummy_project/rawdata.1kg_phase1.clean
```

<!-- THIS CHUNK COULD BE USED TO REMOVE DUPLICATES, BUT IT DOESN'T WORK IN PLINK 1.9 -->
<!-- ``` -->
<!-- plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi \ -->
<!-- --bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg \ -->
<!-- --autosome \ -->
<!-- --maf 0.10 --geno 0.10 --hwe 1e-3 \ -->
<!-- --extract rawdata/raw-GWA-data.prune.in \ -->
<!-- --list-duplicate-vars \ -->
<!-- --make-bed --out dummy_project/rawdata.1kg_phase1.pruned -->
<!-- ``` -->

<!-- ``` -->
<!-- plink --bfile dummy_project/rawdata.1kg_phase1.pruned \ -->
<!-- --exclude dummy_project/rawdata.1kg_phase1.pruned.dupvar \ -->
<!-- --rm-dup force-first \ -->
<!-- --make-bed --out dummy_project/rawdata.1kg_phase1.clean -->
<!-- ``` -->

Before we continue it's best to clean up a bit of the mess. 

```
rm -fv dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi* ref_1kg_phase1_all/1kg_phase1_raw_no_atcg* dummy_project/rawdata.1kg_phase1.pruned* dummy_project/rawdata_1kg_phase1_raw_no_atcg* dummy_project/rawdata.1kg_phase1-merge* dummy_project/rawdata.1kg_phase1.bed dummy_project/rawdata.1kg_phase1.bim dummy_project/rawdata.1kg_phase1.fam dummy_project/rawdata.1kg_phase1.hh dummy_project/rawdata.1kg_phase1.log ref_1kg_phase1_all/1kg_phase1_raw.*
```

Now we have prepared our dataset with only high-quality SNPs that have few missing data, that are high-frequent, exclude problematic genomic ranges, and merged to the 1000G phase 1 reference dataset. Your output should look something like this:

```
PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to dummy_project/rawdata.1kg_phase1.clean.log.
Options in effect:
  --autosome
  --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi
  --bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg
  --extract rawdata/raw-GWA-data.prune.in
  --geno 0.10
  --hwe 1e-3
  --maf 0.10
  --make-bed
  --out dummy_project/rawdata.1kg_phase1.clean

16384 MB RAM detected; reserving 8192 MB for main workspace.
2000 people loaded from dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi.fam.
1092 people to be merged from ref_1kg_phase1_all/1kg_phase1_raw_no_atcg.fam.
Of these, 1092 are new, while 0 are present in the base dataset.
Warning: Multiple positions seen for variant 'rs3934834'.
Warning: Multiple positions seen for variant 'rs3737728'.
Warning: Multiple positions seen for variant 'rs6687776'.
Warning: Multiple chromosomes seen for variant 'rs1050301'.
Warning: Multiple chromosomes seen for variant 'rs4850'.
317476 markers loaded from dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi.bim.
312239 markers to be merged from ref_1kg_phase1_all/1kg_phase1_raw_no_atcg.bim.
Of these, 14 are new, while 312225 are present in the base dataset.
312190 more multiple-position warnings: see log file.
Performing single-pass merge (3092 people, 308317 variants).
Merged fileset written to dummy_project/rawdata.1kg_phase1.clean-merge.bed +
dummy_project/rawdata.1kg_phase1.clean-merge.bim +
dummy_project/rawdata.1kg_phase1.clean-merge.fam .
308317 variants loaded from .bim file.
3092 people (1522 males, 1570 females) loaded from .fam.
3092 phenotype values loaded from .fam.
--extract: 49856 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3078 founders and 14 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.994867.
299 variants removed due to missing genotype data (--geno).
--hwe: 13825 variants removed due to Hardy-Weinberg exact test.
2617 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
33115 variants and 3092 people pass filters and QC.
Phenotype data is quantitative.
--make-bed to dummy_project/rawdata.1kg_phase1.clean.bed +
dummy_project/rawdata.1kg_phase1.clean.bim +
dummy_project/rawdata.1kg_phase1.clean.fam ... done.
```

So in total there are 3,092 individuals, 1,522 males and 1,570 females, and 3,078 founders and 14 non-founders. The total genotyping rate is 99.5% and 33,115 variants are present.


### Principal component analysis

Great, we've prepared our dummy project data and merged this with 1000G phase 1. Let's execute the PCA using `--pca` in `PLINK`.

```
plink --bfile dummy_project/rawdata.1kg_phase1.clean --pca --out dummy_project/rawdata.1kg_phase1.clean
```

### Plotting the PCA results

If all is peachy, you just succesfully ran PCA against 1000G phase 1. Using `--pca` we have calculated principal components (PCs), 20 in total by default, and we can now start plotting them. Let's create a scatter diagram of the first two principal components, including all individuals in the file `rawdata.1kg_phase1.clean.eigenvec` (the first and second principal components are columns 3 and 4, respectively). We need to collect some per-sample information to color the points according to sample origin. 

First we collect the results from the `--pca`, the dummy data phenotype information, and the reference population information.












Derive PC1 and PC2 thresholds so that only individuals who match the given ancestral population are included. For populations of European descent, this will be either the CEU or TSI 1000G individuals (Figure \@ref(fig:showpca1kg)). Here, we chose to exclude all individuals with a first principal component score less than `0.0023`. 

Write the FID and IID of these individuals to a file called `fail-ancestry-QC.txt`.

```
cat dummy_project/rawdata.1kg_phase1.clean.eigenvec | \
awk '$3 < 0.0023' | awk '{ print $1, $2 }' > dummy_project/fail-ancestry-QC.txt
```

Choosing which thresholds to apply (and thus which individuals to remove) is not a straightforward process. The key is to remove those individuals with greatly divergent ancestry, as these samples introduce the most bias to the study. Identification of more fine-scale ancestry can be conducted by using less divergent reference samples (_e.g._, within Europe, stratification could be identified using the CEU, TSI (Italian), GBR (British), FIN (Finnish) and IBS (Iberian) samples from the 1,000 Genomes Project (http://www.1000genomes.org/)). Robust identification of fine-scale population structure often requires the construction of many (2–10) principal components.

<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/gwas-qc-pca-1000g.png" alt="PCA - Your data vs. 1000g." width="85%" />
<p class="caption">(\#fig:showpca1kg)PCA - Your data vs. 1000g.</p>
</div>


## Removing samples

Finally! We have a list of samples of poor quality or divergent ancestry, and duplicated or related samples. We should remove these. Let's collect all IDs from our `fail-*`-files into a single file.

```
cat dummy_project/fail-* | sort -k1 | uniq > dummy_project/fail-qc-inds.txt
```

This new file should now contain a list of unique individuals failing the previous QC steps which we want to remove.

```
plink --bfile dummy_project/rawdata --remove dummy_project/fail-qc-inds.txt --make-bed --out dummy_project/clean_inds_data
```

> Question: How many variants and samples are left? How many cases and how many controls did you loose? 

## The next step

Now that you filtered samples, we should turn our attention to step 2 of the QC for GWAS: identify SNPs of poor quality in Chapter \@ref(gwas_basics_snp_qc).

<!-- ```{js, echo = FALSE} -->
<!-- title=document.getElementById('header'); -->
<!-- title.innerHTML = '<img src="img/_headers/gwas_sample_qc.png" alt="GWAS basics: Sample QC">' + title.innerHTML -->
<!-- ``` -->

<!--chapter:end:03_2_gwas_basics_sample_qc.Rmd-->

# Per-SNP QC {#gwas_basics_snp_qc}
<!-- ![](./img/_gwas_dummy/gwas_snp_qc.png){width=70%} -->






Now that we removed samples, we can focus on low-quality variants.

## SNP call rates

We start by calculating the missing genotype rate for each SNP, in other words the per-SNP call rate.

```
plink --bfile dummy_project/clean_inds_data --missing --out dummy_project/clean_inds_data
```

Let's visualize the results to identify a threshold for extreme genotype failure rate. We chose a callrate threshold of 3%, but it's arbitrary and depending on the dataset, the data (visualization), and the number of samples (Figure \@ref(fig:showsnpcallrate)).






<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-snp-callrate.png" alt="Per SNP call rate." width="85%" />
<p class="caption">(\#fig:showsnpcallrate)Per SNP call rate.</p>
</div>

## Differential SNP call rates

There could also be differences in genotype call rates between cases and controls. It is very important to check for this because these differences could lead to spurious associations. We can test all markers for differences in call rate between cases and controls, or based on other criteria.

```
plink --bfile dummy_project/clean_inds_data --test-missing --out dummy_project/clean_inds_data
```

Let's collect all the SNPs with a significantly different (P < 0.00001) missing data rate between cases and controls.

```
cat dummy_project/clean_inds_data.missing | awk '$5 < 0.00001' | awk '{ print $2 }' > dummy_project/fail-diffmiss-qc.txt
```

## Allele frequencies

We should also get an idea on what the allele frequencies are in our dataset. Low frequent SNPs should probably be excluded, as these are uninformative when monomorphic (allele frequency = 0), or they may lead to spurious associations.

```
plink --bfile dummy_project/clean_inds_data --freq --out dummy_project/clean_inds_data
```

Let's also plot these data. You can view the result below, and try it yourself (Figure \@ref(fig:showfreq)).





<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-freq.png" alt="Minor allele frequency." width="85%" />
<p class="caption">(\#fig:showfreq)Minor allele frequency.</p>
</div>

### A note on allele coding

Oh, one more thing about alleles. 

`PLINK` codes alleles as follows:

A1 = minor allele, the least frequent allele
A2 = major allele, the most frequent allele

And when you use `PLINK` the flag `--freq` or `--maf` is always relative to the A1-allele, as is the odds ratio (OR) or effect size (beta).

However, `SNPTEST` makes use of the so-called OXFORD-format, this codes alleles as follows:

A = the 'other' allele
B = the 'coded' allele

When you use `SNPTEST` it will report the allele frequency as `CAF`, in other words the _coded allele frequency_, and the effect size (beta) is always relative to the B-allele. This means, `CAF` _could_ be the `MAF`, or _minor allele frequency_, but this is **not** a given.

In other words, always make sure what the allele-coding of a given program, be it `PLINK`, `SNPTEST`, `GCTA`, _et cetera_, is! I cannot stress this enough. Ask yourself: 'what is the allele frequency referring to?', 'the effect size is relative to...?'.

Right, let's continue.

## Hardy-Weinberg Equilibrium

Because we are performing a case-control genome-wide association study, we probably expect some differences in Hardy-Weinberg Equilibrium (HWE), but extreme deviations are probably indicative of genotyping errors.

```
plink --bfile dummy_project/clean_inds_data --hardy --out dummy_project/clean_inds_data
```

Let's also plot these data. You can view the result below, and type over the code to do it yourself.






<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-hwe.png" alt="Hardy-Weinberg Equilibrium p-values per stratum." width="85%" />
<p class="caption">(\#fig:showhwe)Hardy-Weinberg Equilibrium p-values per stratum.</p>
</div>

## Final SNP QC

We are ready to perform the final QC. After inspecting the graphs we will filter on a MAF < 0.01, call rate < 0.05, and HWE < 0.00001 (Figure \@ref(fig:showhwe)), in addition those SNPs that failed the differential call rate test will be removed.

```
plink --bfile dummy_project/clean_inds_data --exclude dummy_project/fail-diffmiss-qc.txt --maf 0.01 --geno 0.05 --hwe 0.00001 --make-bed --out dummy_project/cleandata
```

## A Milestone

Congratulations. You reached a very important milestone. Now that you filtered samples and SNPs, we can finally start the association analyses in Chapter \@ref(gwas_testing).

<!-- ```{js, echo = FALSE} -->
<!-- title=document.getElementById('header'); -->
<!-- title.innerHTML = '<img src="img/_headers/gwas_snp_qc.png" alt="GWAS basics: SNP QC">' + title.innerHTML -->
<!-- ``` -->

<!--chapter:end:03_3_gwas_basics_snp_qc.Rmd-->

# Genome-Wide Association Study {#gwas_testing}
<!-- ![](./img/_gwas/interactive_plot.png){width=70%} -->






Now that you have learned how to perform QC, you can easily run a GWAS and execute some downstream visualization and analyses. Let's do this with a dummy dataset.

## Exploring the data

Even though someone says that the QC was done, it is still wise and good practice to run some of the commands above to get a 'feeling' about the data. So let's do this.

```
plink --bfile gwas/gwa --freq --out dummy_project/gwa
```

```
plink --bfile gwas/gwa --missing --out dummy_project/gwa
```

```
plink --bfile gwas/gwa --hardy --out dummy_project/gwa
```

Let's visualize the results. First we should load in all the results.

> Question: Load the data using R. [Hint: use and adapt the examples from the previous chapters.]



We can plot the per-stratum HWE p-values.

> Question: Plot the per-stratum HWE p-values using R. [Hint: use and adapt the examples from the previous chapters.]



<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-hwe-gwas.png" alt="Per stratum HWE p-values." width="85%" />
<p class="caption">(\#fig:unnamed-chunk-2)Per stratum HWE p-values.</p>
</div>

We will want to see what the distribution of allele frequencies looks like. 

> Question: Plot the allele frequencies using R. [Hint: use and adapt the examples from the previous chapters.]



<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-freq-gwas.png" alt="Minor allele frequencies." width="85%" />
<p class="caption">(\#fig:unnamed-chunk-3)Minor allele frequencies.</p>
</div>

We will want to identify samples that have poor call rates.

> Question: Plot the per-sample call rates using R. [Hint: use and adapt the examples from the previous chapters.]



<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-sample-callrate-gwas.png" alt="Per sample call rates." width="85%" />
<p class="caption">(\#fig:unnamed-chunk-4)Per sample call rates.</p>
</div>

We also need to know what the per SNP call rates are.

> Question: Plot the per-SNP call rates using R. [Hint: use and adapt the examples from the previous chapters.]



<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-snp-callrate-gwas.png" alt="Per SNP call rates." width="85%" />
<p class="caption">(\#fig:unnamed-chunk-5)Per SNP call rates.</p>
</div>


## Genetic models
A simple chi-square test of association can be done.

```
plink --bfile gwas/gwa --model --out gwas/data
```

_Genotypic_, _dominant_ and _recessive_ tests will not be conducted if any one of the cells in the table of case-control by genotype counts contains less than five observations. This is because the chi-square approximation may not be reliable when cell counts are small. For SNPs with MAFs < 5%, a sample of more than 2,000 cases and controls would be required to meet this threshold and more than 50,000 would be required for SNPs with MAF < 1%.

You can change this default behaviour by adding the flag `--cell`, _e.g._, we could lower the threshold to 3.

```
plink --bfile gwas/gwa --model --cell 3 --out gwas/data
```

Let's review the contents of the results.



It contains 1,530,510 rows, one for each SNP, and each type of test (_genotypic_, _trend_, _allelic_, _dominant_, and _recessive_) and the following columns:

- chromosome [`CHR`],
- the SNP identifier [`SNP`],
- the minor allele [`A1`] (remember, `PLINK` always codes the A1-allele as the minor allele!),
- the major allele [`A2`],
- the test performed [`TEST`]:
  - `GENO` (genotypic association);
  - `TREND` (Cochran-Armitage trend);
  - `ALLELIC` (allelic as- sociation);
  - `DOM` (dominant model); and
  - `REC` (recessive model)],
- the cell frequency counts for cases [`AFF`], 
- the cell frequency counts for controls [`UNAFF`],
- the chi-square test statistic [`CHISQ`],
- the degrees of freedom for the test [`DF`],
- and the asymptotic P value [`P`] of association.

> Question: Do you know which model, _i.e._ `TEST` is most commonly used and reported? And why is that, do think?

## Logistic regression
We can also perform a test of association using logistic regression. In this case we might want to correct for covariates/confounding factors, for example age, sex, ancestral background, i.e. principal components, and other study specific covariates (e.g. hospital of inclusion, genotyping centre etc.). In that case each of these P values is adjusted for the effect of the covariates.

When running a regression analysis, be it linear or logistic, PLINK assumes a multiplicative model. By default, when at least one male and one female is present, sex (male = 1, female = 0) is automatically added as a covariate on X chromosome SNPs, and nowhere else. The `sex` flag causes it to be added everywhere, while `no-x-sex` excludes it.

```
plink --bfile gwas/gwa --logistic sex --covar gwas/gwa.covar --out gwas/data
```

Let's examine the results.



```
## [1] 918306      9
```


```
##      CHR       SNP      BP     A1   TEST NMISS     OR     STAT      P
##    <int>    <char>   <int> <char> <char> <int>  <num>    <num>  <num>
## 1:     1 rs3934834  995669      T    ADD  3818 1.0290  0.38120 0.7031
## 2:     1 rs3934834  995669      T    AGE  3818 1.0020  1.11800 0.2635
## 3:     1 rs3934834  995669      T    SEX  3818 1.0120  0.19090 0.8486
## 4:     1 rs3737728 1011278      A    ADD  3982 1.0190  0.38670 0.6990
## 5:     1 rs3737728 1011278      A    AGE  3982 1.0020  1.09800 0.2721
## 6:     1 rs3737728 1011278      A    SEX  3982 1.0060  0.09898 0.9212
## 7:     1 rs6687776 1020428      T    ADD  3915 0.9692 -0.33330 0.7389
## 8:     1 rs6687776 1020428      T    AGE  3915 1.0020  1.04000 0.2984
## 9:     1 rs6687776 1020428      T    SEX  3915 1.0150  0.23690 0.8127
```

> Question: How come there are more lines in this file than there are variants?

If no model option is specified, the first row for each SNP corresponds to results for a multiplicative test of association. The C >= 0 subsequent rows for each SNP correspond to separate tests of significance for each of the C covariates included in the regression model. We can remove the covariate-specific lines from the main report by adding the `hide-covar` flag.

The columns in the association results are:

- the chromosome [`CHR`],
- the SNP identifier [`SNP`],
- the base-pair location [`BP`],
- the minor allele [`A1`],
- the test performed [`TEST`]: ADD (multiplicative model or genotypic model testing additivity),
  - `GENO_2DF` (genotypic model),
  - `DOMDEV` (genotypic model testing deviation from additivity),
  - `DOM` (dominant model), or
  - `REC` (recessive model)],
- the number of missing individuals included [`NMISS`],
- the `OR` relative to the A1, _i.e._ minor allele,
- the coefficient z-statistic [`STAT`], and
- the asymptotic P-value [`P`] of association.

We need to calculate the standard error and confidence interval from the z-statistic. We can modify the effect size (`OR`) to output the _beta_ by adding the `beta` flag.

> Question: Can you write down the mathematical relation between _beta_ and _OR_?

## Let's get visual

Looking at numbers is important, but it won't give you a perfect overview. We should turn to visualizing our results in Chapter \@ref(gwas-visuals).

<!-- ```{js, echo = FALSE} -->
<!-- title=document.getElementById('header'); -->
<!-- title.innerHTML = '<img src="img/_headers/interactive_plot.png" alt="GWAS basics: association testing">' + title.innerHTML -->
<!-- ``` -->

<!--chapter:end:03_4_gwas_basics_association_testing.Rmd-->

# GWAS visualisation {#gwas-visuals}
<!-- ![](./img/_headers/interactive_plot.png){width=100%} -->





Data visualization is key, not only for presentation but also to inspect the results.

### QQ plots
We should create _quantile-quantile (QQ) plots_ to compare the observed association test statistics with their expected values under the null hypothesis of no association and so assess the number, magnitude and quality of true associations.

First, we will add the standard error, call rate, A2, and allele frequencies.


```
## Key: <SNP>
##           SNP   CHR        BP     A1 NMISS     OR    STAT       P     A2    MAF
##        <char> <int>     <int> <char> <int>  <num>   <num>   <num> <char>  <num>
## 1: rs10000010     4  21227772      C  3996 1.0420  0.9010 0.36760      T 0.4258
## 2: rs10000023     4  95952929      T  3957 0.9902 -0.2160 0.82900      G 0.4841
## 3: rs10000030     4 103593179      A  3991 0.9779 -0.3696 0.71170      G 0.1616
## 4:  rs1000007     2 237416793      C  4000 1.0180  0.3649 0.71520      T 0.3122
## 5: rs10000092     4  21504615      C  3963 0.9240 -1.6770 0.09354      T 0.3430
## 6: rs10000121     4 157793485      G  3919 0.9665 -0.7525 0.45170      A 0.4532
##    NCHROBS callrate
##      <int>    <num>
## 1:    7992  0.99900
## 2:    7914  0.98925
## 3:    7982  0.99775
## 4:    8000  1.00000
## 5:    7926  0.99075
## 6:    7838  0.97975
```

```
## [1] 306102     14
```

```
## # A tibble: 6 × 14
##   SNP        CHR     BP A1    A2      MAF callrate NMISS NCHROBS     BETA     SE
##   <chr>    <int>  <int> <chr> <chr> <dbl>    <dbl> <int>   <int>    <dbl>  <dbl>
## 1 rs10000…     4 2.12e7 C     T     0.426    0.999  3996    7992  0.0411  0.0457
## 2 rs10000…     4 9.60e7 T     G     0.484    0.989  3957    7914 -0.00985 0.0456
## 3 rs10000…     4 1.04e8 A     G     0.162    0.998  3991    7982 -0.0223  0.0605
## 4 rs10000…     2 2.37e8 C     T     0.312    1      4000    8000  0.0178  0.0489
## 5 rs10000…     4 2.15e7 C     T     0.343    0.991  3963    7926 -0.0790  0.0471
## 6 rs10000…     4 1.58e8 G     A     0.453    0.980  3919    7838 -0.0341  0.0453
## # ℹ 3 more variables: OR <dbl>, STAT <dbl>, P <dbl>
```

Let's list the number of SNPs per chromosome. This gives a pretty good idea about the per-chromosome coverage. And it's a sanity check: did the whole analysis run properly (we expect 22 chromosomes)?





> Question: Why do the number of variants per chrosome (approximately) correlate with the chromosome number?

> Question: Where are the data for chromosome X, Y and MT?

Let's plot the QQ plot to diagnose our GWAS. 


<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-qq.png" alt="A QQ plot." width="85%" />
<p class="caption">(\#fig:show-qq)A QQ plot.</p>
</div>

## Manhattan plots

We also need to create a _Manhattan plot_ to display the association test P-values as a function of chromosomal location and thus provide a visual summary of association test results that draw immediate attention to any regions of significance (Figure \@ref(fig:showmanhattan)).



<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-manhattan.png" alt="A manhattan plot." width="85%" />
<p class="caption">(\#fig:showmanhattan)A manhattan plot.</p>
</div>

## Other plots

It is also informative to plot the density per chromosome. We can use the `CMplot` for that which you can find [here](https://github.com/YinLiLin/R-CMplot){target="_blank"}. For now we just make these graphs 'quick-n-dirty', you can further prettify them, but you easily loose track of time, so maybe carry on.






> Question: What do the grey spots on the density plot indicate?

This would lead to the following graphs. 
<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-cmplot-all-density.png" alt="SNP density of the association results." width="85%" />
<p class="caption">(\#fig:showcmplotalldensity)SNP density of the association results.</p>
</div>

<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-cmplot-all-qq.png" alt="A QQ plot including a 95% confidence interval (blue area) and genome-wide significant hits (red)." width="85%" />
<p class="caption">(\#fig:showcmplotallqq)A QQ plot including a 95% confidence interval (blue area) and genome-wide significant hits (red).</p>
</div>

<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-cmplot-all-manhattan.png" alt="A regular manhattan plot. Colored by chromosome, suggestive hits are green, genome-wide hits are red. The bottom graph shows the per-chromosome SNP density." width="85%" />
<p class="caption">(\#fig:showcmplotallmanhattan)A regular manhattan plot. Colored by chromosome, suggestive hits are green, genome-wide hits are red. The bottom graph shows the per-chromosome SNP density.</p>
</div>

<div class="figure" style="text-align: center">
<img src="img/_gwas_dummy/show-cmplot-all-circular.png" alt="A circular manhattan." width="85%" />
<p class="caption">(\#fig:show-cmplot-all-circular)A circular manhattan.</p>
</div>

## Interactive plots

You can also make an [interactive version](https://r-graph-gallery.com/101_Manhattan_plot.html){target="_blank"} of the Manhattan - just because you can. The code below shows you how.


```
library(plotly)
library(dplyr)

# Prepare the dataset (as an example we use the data (gwasResults) from the `qqman`-package)
don <- gwasResults %>%

  # Compute chromosome size
  group_by(CHR) %>%
  summarise(chr_len=max(BP)) %>%

  # Calculate cumulative position of each chromosome
  mutate(tot=cumsum(chr_len)-chr_len) %>%
  select(-chr_len) %>%

  # Add this info to the initial dataset
  left_join(gwasResults, ., by=c("CHR"="CHR")) %>%

  # Add a cumulative position of each SNP
  arrange(CHR, BP) %>%
  mutate( BPcum=BP+tot) %>%

  # Add highlight and annotation information
  mutate( is_highlight=ifelse(SNP %in% snpsOfInterest, "yes", "no")) %>%

  # Filter SNP to make the plot lighter
  filter(-log10(P)>0.5)

# Prepare X axis
axisdf <- don %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )

# Prepare text description for each SNP:
don$text <- paste("SNP: ", don$SNP, "\nPosition: ", don$BP, "\nChromosome: ", don$CHR, "\nLOD score:", -log10(don$P) %>% round(2), "\nWhat else do you wanna know", sep="")

# Make the plot
p <- ggplot(don, aes(x=BPcum, y=-log10(P), text=text)) +

    # Show all points
    geom_point( aes(color=as.factor(CHR)), alpha=0.8, size=1.3) +
    scale_color_manual(values = rep(c("grey", "skyblue"), 22 )) +

    # custom X axis:
    scale_x_continuous( label = axisdf$CHR, breaks= axisdf$center ) +
    scale_y_continuous(expand = c(0, 0)) +     # remove space between plot area and x axis

    # Add highlighted points
    geom_point(data=subset(don, is_highlight=="yes"), color="orange", size=2) +

    # Custom the theme:
    theme_bw() +
    theme(
      legend.position="none",
      panel.border = element_blank(),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank()
    )
ggplotly(p, tooltip="text")
```


It will produce something like this.

![](img/_gwas/interactive_plot.png)


Again, this is an example with dummy data - you can try to do it for our GWAS, but careful with the time. You can also choose to carry on.

You will encounter the above types of visualizations in any high-quality GWAS paper, because each is so critically informative. Usually, analysts of large-scale meta-analyses of GWAS will also stratify the QQ-plots based on the imputation quality (if your GWAS was imputed), call rate, and allele frequency (although that is rarely shared in publications, not even in supplemental material).


## Stop playing around

Alright. It's time to stop playing around and do a quick recap of what you've learned.

1. You learned how to convert datasets.
2. You learned how to execute sample QC and create diagnostic graphics
3. You learned how to do the same for SNP QC
4. You learned how to execute an association study given a dataset, covariates, and different assumptions regarding the genetic model.
5. You learned how to visualize results and played around with different visuals. 

You should be ready for the real stuff. And if not, the next chapter will help you get ready: Chapter \@ref(wtccc1_intro).

<!-- ```{js, echo = FALSE} -->
<!-- title=document.getElementById('header'); -->
<!-- title.innerHTML = '<img src="img/_headers/interactive_plot.png" alt="GWAS basics: visualization">' + title.innerHTML -->
<!-- ``` -->

<!--chapter:end:03_5_gwas_basics_results_visualization.Rmd-->

# Licenses and disclaimers {#license}
<!-- ![](img/_headers/licenses.png){width=100%} -->





## Copyright

This book and all its material ("content") is protected by copyright under Dutch Copyright laws and is the property of the author or the party credited as the provider of the content. You may not copy, reproduce, distribute, publish, display, perform, modify, create derivative works, transmit, or in any way exploit any such content, nor may you distribute any part of this content over any network, including a local area network, sell or offer it for sale, or use such content to construct any kind of database. You may not alter or remove any copyright or other notice from copies of the content on this website. Copying or storing any content except as provided above is expressly prohibited without prior written permission of the author or the copyright holder identified in the individual content's copyright notice. For permission to use this content, please contact the author.

## Disclaimer

The content contained herein is provided only for educational and informational purposes or as required by Dutch law. The author attempted to ensure that content is accurate and obtained from reliable sources, but does not represent it to be error-free. The author may add, amend or repeal any text, procedure or regulation, and failure to timely post such changes to this book shall not be construed as a waiver of enforcement. The author does not warrant that any functions on this website or the contents and references herein will be uninterrupted, that defects will be corrected, or that this website or the contents and references will be free from viruses or other harmful components. Any links to third party information on the author’s website are provided as a courtesy and do not constitute an endorsement of those materials or the third party providing them.

## Images and data used

I took the at-most care to use refer to the original works and data sources where needed. Likewise, all the images c.q. figures are either produced specifically for this book, or I took them from [**Unsplash**](https://unsplash.com/s/photos/legal) to brighten up the book. If you feel I made a mistake and your work should be properly referenced, please don't hesitate to contact me. 

These are the images from **Unsplash** listed here in no particular order.

- papers_on_wall - https://unsplash.com/photos/open-book-lot-Oaqk7qqNh_c
- women_behind_macbook - https://unsplash.com/photos/woman-using-macbook-pro-with-person-in-white-top-bPVM4nOy0Rg
- woman_working_on_code - https://unsplash.com/photos/woman-wearing-black-t-shirt-holding-white-computer-keyboard-YK0HPwWDJ1I
- licenses - https://unsplash.com/photos/book-lot-on-black-wooden-shelf-zeH-ljawHtg

## Copyright

Copyright 1979-2024. All rights reserved. Sander W. van der Laan | [s.w.vanderlaan [at] gmail.com](mailto:s.w.vanderlaan@gmail.com) | [https://vanderlaanand.science](https://vanderlaanand.science){target="_blank"}. Published with [`bookdown`](https://bookdown.org/yihui/bookdown/){target="_blank"}.

<!-- ```{js, echo = FALSE} -->
<!-- title=document.getElementById('header'); -->
<!-- title.innerHTML = '<img src="img/_headers/banner_man_standing_dna.png" alt="Licenses">' + title.innerHTML -->
<!-- ``` -->

<!--chapter:end:11_licenses.Rmd-->

# Colophon





The 2022 and 2024 editions of this book were produce in RStudio and with the `bookdown` package. Below a listing of installed programs and libraries, the operating system, and their specific versions.


```
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.3 (2024-02-29)
##  os       macOS Sonoma 14.5
##  system   x86_64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/New_York
##  date     2024-04-04
##  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package           * version date (UTC) lib source
##  askpass             1.2.0   2023-09-03 [2] CRAN (R 4.3.0)
##  bookdown          * 0.38.1  2024-03-26 [2] Github (rstudio/bookdown@50a1c1e)
##  bslib               0.6.2   2024-03-22 [2] CRAN (R 4.3.2)
##  cachem              1.0.8   2023-05-01 [2] CRAN (R 4.3.0)
##  chromote            0.2.0   2024-02-12 [1] CRAN (R 4.3.2)
##  cli                 3.6.2   2023-12-11 [2] CRAN (R 4.3.0)
##  colorspace          2.1-0   2023-01-23 [2] CRAN (R 4.3.0)
##  crayon              1.5.2   2022-09-29 [2] CRAN (R 4.3.0)
##  crul                1.4.0   2023-05-17 [2] CRAN (R 4.3.0)
##  curl                5.2.1   2024-03-01 [2] CRAN (R 4.3.2)
##  data.table          1.15.4  2024-03-30 [1] CRAN (R 4.3.2)
##  digest              0.6.35  2024-03-11 [2] CRAN (R 4.3.2)
##  evaluate            0.23    2023-11-01 [2] CRAN (R 4.3.0)
##  fastmap             1.1.1   2023-02-24 [2] CRAN (R 4.3.0)
##  flextable         * 0.9.5   2024-03-06 [1] CRAN (R 4.3.2)
##  fontBitstreamVera   0.1.1   2017-02-01 [2] CRAN (R 4.3.0)
##  fontLiberation      0.1.0   2016-10-15 [2] CRAN (R 4.3.0)
##  fontquiver          0.2.1   2017-02-01 [2] CRAN (R 4.3.0)
##  formatR           * 1.14    2023-01-17 [2] CRAN (R 4.3.0)
##  gdtools             0.3.7   2024-03-05 [2] CRAN (R 4.3.2)
##  gfonts              0.2.0   2023-01-08 [2] CRAN (R 4.3.0)
##  glue                1.7.0   2024-01-09 [2] CRAN (R 4.3.0)
##  htmltools           0.5.8   2024-03-25 [2] CRAN (R 4.3.2)
##  httpcode            0.3.0   2020-04-10 [2] CRAN (R 4.3.0)
##  httpuv              1.6.15  2024-03-26 [2] CRAN (R 4.3.2)
##  jquerylib           0.1.4   2021-04-26 [2] CRAN (R 4.3.0)
##  jsonlite            1.8.8   2023-12-04 [2] CRAN (R 4.3.0)
##  kableExtra        * 1.4.0   2024-01-24 [1] CRAN (R 4.3.2)
##  knitr             * 1.45    2023-10-30 [1] CRAN (R 4.3.0)
##  later               1.3.2   2023-12-06 [2] CRAN (R 4.3.0)
##  lifecycle           1.0.4   2023-11-07 [2] CRAN (R 4.3.0)
##  magrittr            2.0.3   2022-03-30 [2] CRAN (R 4.3.0)
##  mime                0.12    2021-09-28 [2] CRAN (R 4.3.0)
##  munsell             0.5.1   2024-04-01 [1] CRAN (R 4.3.2)
##  officer             0.6.5   2024-02-24 [2] CRAN (R 4.3.2)
##  openssl             2.1.1   2023-09-25 [2] CRAN (R 4.3.0)
##  processx            3.8.4   2024-03-16 [2] CRAN (R 4.3.2)
##  promises            1.2.1   2023-08-10 [2] CRAN (R 4.3.0)
##  ps                  1.7.6   2024-01-18 [2] CRAN (R 4.3.0)
##  R6                  2.5.1   2021-08-19 [2] CRAN (R 4.3.0)
##  ragg                1.3.0   2024-03-13 [2] CRAN (R 4.3.2)
##  Rcpp                1.0.12  2024-01-09 [2] CRAN (R 4.3.0)
##  rlang               1.1.3   2024-01-10 [2] CRAN (R 4.3.0)
##  rmarkdown         * 2.26.1  2024-03-26 [2] Github (rstudio/rmarkdown@ee69d59)
##  rstudioapi          0.16.0  2024-03-24 [2] CRAN (R 4.3.2)
##  sass                0.4.9   2024-03-15 [2] CRAN (R 4.3.2)
##  scales              1.3.0   2023-11-28 [2] CRAN (R 4.3.0)
##  sessioninfo         1.2.2   2021-12-06 [2] CRAN (R 4.3.0)
##  shiny               1.8.1   2024-03-26 [2] CRAN (R 4.3.2)
##  stringi             1.8.3   2023-12-11 [2] CRAN (R 4.3.0)
##  stringr             1.5.1   2023-11-14 [2] CRAN (R 4.3.0)
##  svglite             2.1.3   2023-12-08 [1] CRAN (R 4.3.0)
##  systemfonts         1.0.6   2024-03-07 [2] CRAN (R 4.3.2)
##  textshaping         0.3.7   2023-10-09 [2] CRAN (R 4.3.0)
##  tinytex           * 0.50    2024-03-16 [2] CRAN (R 4.3.2)
##  uuid                1.2-0   2024-01-14 [2] CRAN (R 4.3.0)
##  viridisLite         0.4.2   2023-05-02 [2] CRAN (R 4.3.0)
##  webshot           * 0.5.5   2023-06-26 [1] CRAN (R 4.3.0)
##  webshot2          * 0.1.1   2023-08-11 [1] CRAN (R 4.3.0)
##  websocket           1.4.1   2021-08-18 [1] CRAN (R 4.3.0)
##  xfun                0.43    2024-03-25 [2] CRAN (R 4.3.2)
##  xml2                1.3.6   2023-12-04 [2] CRAN (R 4.3.0)
##  xtable              1.8-4   2019-04-21 [2] CRAN (R 4.3.0)
##  yaml                2.3.8   2023-12-11 [2] CRAN (R 4.3.0)
##  zip                 2.3.1   2024-01-27 [2] CRAN (R 4.3.2)
## 
##  [1] /Users/slaan3/Library/R/x86_64/4.3/library
##  [2] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────
```

<!-- ```{js, echo = FALSE} -->
<!-- title=document.getElementById('header'); -->
<!-- title.innerHTML = '<img src="img/_headers/banner_man_standing_dna.png" alt="Colofon">' + title.innerHTML -->
<!-- ``` -->

<!-- example: https://yearbookdiscoveries.com/wp-content/uploads/2014/05/Writing_a_Yearbook_Colophon.pdf -->
<!-- SPECIAL THANKS: (The staff wrote a few paragraphs about the year and mentioned a variety -->
<!-- of people who were instrumental in the success of their yearbook.) -->
<!-- COVER & ENDSHEETS: The 2013 Pinnacle cover is a four-color lithograph. An iridescent foil -->
<!-- covers a portion of the theme design. The endsheets are standard stock paper. The theme -->
<!-- concept was created and expanded by the editorial team and members of the 2013 Pinnacle -->
<!-- staff. Cover and endsheets were designed by Pinnacle co-editors-in-chief Regan Brown and -->
<!-- Ellena Sullivan, with inspiration provided by an early design from co-reference editor -->
<!-- Amanda Farrer. -->
<!-- TYPE & COLOR TREATMENT: Body copy throughout the book is set in Frutiger Light -->
<!-- Condensed (8.5 pt.) Captions are set in Frutiger Light Condensed (7.5 pt.) Headline -->
<!-- treatments are designed with variations of AHJ Nashville, Arno Pro and Frutiger. Photo -->
<!-- credits and spread credits appear in Frutiger Italic (6 pt.) -->
<!-- For consistency, a color palette was chosen. In addition to the traditional black, the -->
<!-- following colors appear throughout the publication: Pantone 151C, Pantone 3005C, Panton -->
<!-- 2985C, Pantone 115C, Pantone 376C, Pantone 363C, Pantone 266C, Pantone 185C and -->
<!-- Pantone Cool Grey 7C. -->
<!-- PUBLISHING: Volume 107 of the Pinnacle was designed and produced by the 2013 Pinnacle -->
<!-- staff. The 456-page, all-color Pinnacle is printed on 80 lb. gloss paper by Herff Jones -->
<!-- Publishing Co. in Kansas City, MO. Approximately 3,000 copies were pre-ordered for $52. -->
<!-- Any extra copies were sold for $60. A 48-page supplement was included in this price. The -->
<!-- publication was created using Adobe CS5.5 software on 42 Macintosh desktop and laptop -->
<!-- computers. -->
<!-- PHOTOGRAPHY: Pinnacle staff photographers shot digital photos using four Nikon D70s, -->
<!-- two Nikon D80s and one Nikon D40. Sport group photos were shot by Prestige Portraits, -->
<!-- and club and group photos were shot by both Prestige Portraits and Pinnacle yearbook staff -->
<!-- photographers. Some submitted photos appear throughout the book as well. -->
<!-- EDITORS’ NOTE: (A special note from the co-editors-in-chief was included here. The yearbook -->
<!-- staff photo with names and staff positions was included on the spread with the colophon.) -->

<!--chapter:end:12_colophon.Rmd-->


# References {-}


<!--chapter:end:13_references.Rmd-->

